sourc.dev
Home LLMs Tools SaaS APIs
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
#40 of 50

Multimodal

Models that see, hear, and read — and what that costs

What is multimodal

A multimodal model processes more than one type of input — text, images, audio, or video. GPT-4 was text-only. GPT-4V added vision. GPT-4o added audio. Each modality added cost and capability.

The term entered common use in 2023 when OpenAI released GPT-4V and Google released Gemini with native multimodal support. Before this, separate models handled separate modalities — one for text, another for images, a third for speech.

Why it matters

Multimodal capability changes what applications are possible. A text-only model cannot read a screenshot, transcribe a meeting, or describe a product photo. A multimodal model can. But each additional modality has a token cost — images are tokenised at a fixed rate, audio at another. sourc.dev tracks vision and audio support as capability flags on every model.

Verified March 2026 · Source: OpenAI GPT-4V announcement, Google Gemini docs

Related terms
Vision / image inputTokenContext window
← All terms
← Agents Structured output →