#40 of 50

Multimodal

Models that see, hear, and read — and what that costs

What is multimodal

A multimodal model processes more than one type of input — text, images, audio, or video. GPT-4 was text-only. GPT-4V added vision. GPT-4o added audio. Each modality added cost and capability.

The term entered common use in 2023 when OpenAI released GPT-4V and Google released Gemini with native multimodal support. Before this, separate models handled separate modalities — one for text, another for images, a third for speech.

Why it matters

Multimodal capability changes what applications are possible. A text-only model cannot read a screenshot, transcribe a meeting, or describe a product photo. A multimodal model can. But each additional modality has a token cost — images are tokenised at a fixed rate, audio at another. sourc.dev tracks vision and audio support as capability flags on every model.

Verified March 2026 · Source: OpenAI GPT-4V announcement, Google Gemini docs

← All terms

← Agents Structured output →