Multimodal AI

AI models that natively accept and/or produce more than one modality — text, image, audio, video — in a single model.

A multimodal model is trained jointly on multiple data types so it can reason across them. GPT-4o, Claude with vision, and Gemini 2.5 Pro all accept text and images; Gemini natively handles audio and video, and OpenAI's Realtime API streams audio in and out with sub-second latency.

Multimodal capabilities unlock use cases that text-only models cannot reach: reading a screenshot, transcribing and summarising a meeting in one call, generating an image from a chat turn, or controlling a browser by 'seeing' the page.

Output multimodality (image and audio generation) is usually a separate model — Imagen, DALL·E, gpt-image, Veo, Sora — accessed from the same provider's API.

Related terms

Large Language Model (LLM)
A neural network trained on massive text corpora to predict the next token, used for chat, coding, reasoning and as the brain inside AI agents.
AI Agent
An LLM-based system that can plan, use tools and take multi-step actions toward a goal — not just answer a single prompt.

More to explore

Other wiki entries that touch on Multimodal AI.

Tokens
The atomic units that LLMs read and write — sub-word pieces produced by a tokenizer. Pricing and context limits are measured in tokens, not words.
Context Window
The maximum number of tokens an LLM can read in a single request, including the prompt, retrieved documents and the model's own reply.
Prompt Engineering
The practice of designing inputs to LLMs to reliably produce useful outputs — through structure, examples, role-setting and constraints.
Model Context Protocol (MCP)
An open standard from Anthropic for connecting LLMs to external tools and data sources through a uniform server interface.

Multimodal AI

See also on SoftPerceptron

Related terms

More to explore