Multimodal AI
AI models that natively accept and/or produce more than one modality — text, image, audio, video — in a single model.
A multimodal model is trained jointly on multiple data types so it can reason across them. GPT-4o, Claude with vision, and Gemini 2.5 Pro all accept text and images; Gemini natively handles audio and video, and OpenAI's Realtime API streams audio in and out with sub-second latency.
Multimodal capabilities unlock use cases that text-only models cannot reach: reading a screenshot, transcribing and summarising a meeting in one call, generating an image from a chat turn, or controlling a browser by 'seeing' the page.
Output multimodality (image and audio generation) is usually a separate model — Imagen, DALL·E, gpt-image, Veo, Sora — accessed from the same provider's API.
See also on SoftPerceptron
Related terms
More to explore
Other wiki entries that touch on Multimodal AI.
- Tokens
The atomic units that LLMs read and write — sub-word pieces produced by a tokenizer. Pricing and context limits are measured in tokens, not words.
- Context Window
The maximum number of tokens an LLM can read in a single request, including the prompt, retrieved documents and the model's own reply.
- Prompt Engineering
The practice of designing inputs to LLMs to reliably produce useful outputs — through structure, examples, role-setting and constraints.
- Model Context Protocol (MCP)
An open standard from Anthropic for connecting LLMs to external tools and data sources through a uniform server interface.