#7 of 25

Vision / image input

You can skip writing descriptions entirely — send the image and let the model read it

What is vision / image input

Before vision, you described things to the model. Now you can show it.

A screenshot of an error message. A photograph of a form. A chart from a presentation. A product photo. A handwritten note. You attach the image, ask your question, and the model looks at it the same way you would — reading text, interpreting layout, identifying objects, describing what it sees.

The model reads images the way it reads words. Same tokens. Same context window. Different format.

The number that makes it real

Among the 30 models tracked on sourc.dev, vision input is one of the most common capability flags. Claude 3.5 Sonnet supports it. GPT-4o supports it. Gemini 1.5 Pro supports it. Verified March 2026.

Image tokens cost more than text tokens — a typical image uses 500–1,500 tokens depending on resolution and model. Factor that into cost estimates for image-heavy applications.

Why this matters to you

The most immediate use case is not the most obvious one.

Developers use vision to process documents that are images — scanned PDFs, screenshots, photos of forms. Instead of building a separate OCR pipeline to extract text before sending it to the model, you send the image directly. One step instead of three. The model reads it.

Support teams use it to let users attach screenshots of errors. Instead of asking "what does the error message say?", you just see it.

Data analysts use it to send charts and ask for interpretation. The model describes the trend, names the outlier, reads the axis labels.

Any time you would otherwise write a description of something visual — consider whether you can just send the visual.

The edge case worth knowing

Vision models describe what they see, but they can misread small text, complex tables, and low-resolution images. For critical data extraction — financial figures, medical records, legal text — always verify the model's reading against the source. It is a capable reader. It is not infallible.

Verified March 2026 · Source: Anthropic, OpenAI model capability documentation

← All terms

← Function calling Streaming →

Vision / image input

What is vision / image input

The number that makes it real

Why this matters to you

The edge case worth knowing

Related terms