Multimodal AI

AI models that natively accept and/or produce more than one modality — text, image, audio, video — in a single model.

A multimodal model is trained jointly on multiple data types so it can reason across them. GPT-4o, Claude with vision, and Gemini 2.5 Pro all accept text and images; Gemini natively handles audio and video, and OpenAI's Realtime API streams audio in and out with sub-second latency.

Multimodal capabilities unlock use cases that text-only models cannot reach: reading a screenshot, transcribing and summarising a meeting in one call, generating an image from a chat turn, or controlling a browser by 'seeing' the page.

Output multimodality (image and audio generation) is usually a separate model — Imagen, DALL·E, gpt-image, Veo, Sora — accessed from the same provider's API.

See also on SoftPerceptron

Related terms

More to explore

Other wiki entries that touch on Multimodal AI.