#24 of 25

Latency

Your users do not measure tokens — they measure seconds. This is the number that determines whether your product feels good.

What the model does:

Receives your request → processes tokens → generates a response → sends it.

What your user experiences:

Clicks a button → waits → reads.

Latency is the time between sending a request and receiving the first byte of a response. It is measured in milliseconds. It is what the user experiences as speed, whether or not they have ever heard the word.

Time to First Token vs Total Time

Two latency numbers matter for AI applications.

Time to First Token (TTFT): how long until the first word of the response starts arriving. With streaming enabled, this is what determines whether your interface feels alive or frozen. A TTFT under 800ms feels responsive. Over 2 seconds starts to feel broken.

Total generation time: how long until the complete response arrives. For long responses, this can be 10–30 seconds. With streaming, users are reading while the model is still writing — total time matters less than it seems.

Why this matters to you

Latency varies significantly by model, provider, time of day, and your location relative to the server. A model that is cheap and capable but slow is a different product decision than a model that costs more but responds quickly.

For user-facing applications where someone is waiting for a response, latency is a product quality metric. A 500ms response feels instant. A 5-second response feels slow. A 15-second response — without streaming — loses users.

For background processing — batch jobs, document analysis, pipelines that run overnight — latency is irrelevant. Cost and quality are what matter.

How to use this

Match your latency requirements to your use case before you pick a model. If users are waiting in real time, test TTFT on your target models before you commit. Enable streaming. Measure from your deployment region, not from wherever the benchmark was run.

The fastest model for your users is not necessarily the model with the lowest published latency. Network distance, server load, and your infrastructure all contribute. Test it yourself.

Verified March 2026 · Source: Provider API documentation, independent latency benchmarks

← All terms

← Fine-tuning What does 70B mean →