sourc.dev
Home LLMs Tools SaaS APIs
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
#43 of 50

AI benchmarks

The numbers everyone cites and almost nobody understands

What are AI benchmarks

AI benchmarks are standardised tests that measure model performance on defined tasks. MMLU tests general knowledge across 57 subjects. HumanEval tests code generation. MATH tests mathematical reasoning. Each produces a score that enables comparison across models.

The case for benchmarks: they are reproducible, comparable, and version-controlled. The same test applied to different models gives a direct signal about relative capability. Without benchmarks, model comparison would be purely anecdotal.

The case against: benchmarks can be gamed. Models can be trained on benchmark data (contamination). High scores on HumanEval do not guarantee good real-world code. And as models approach ceiling scores, benchmarks lose discriminative power.

Why it matters

sourc.dev publishes benchmark scores as verified attributes — not as rankings. The scores are data points, not verdicts. A high MMLU score is useful context. It is not a recommendation. The methodology page at [/methodology](/methodology) documents how each benchmark is sourced and verified.

Verified March 2026 · Source: MMLU paper, HumanEval paper, sourc.dev methodology

Related terms
MMLUHumanEvalLLM
← All terms
← Reasoning models Model family →