#43 of 50

AI benchmarks

The numbers everyone cites and almost nobody understands

What are AI benchmarks

AI benchmarks are standardised tests that measure model performance on defined tasks. MMLU tests general knowledge across 57 subjects. HumanEval tests code generation. MATH tests mathematical reasoning. Each produces a score that enables comparison across models.

The case for benchmarks: they are reproducible, comparable, and version-controlled. The same test applied to different models gives a direct signal about relative capability. Without benchmarks, model comparison would be purely anecdotal.

The case against: benchmarks can be gamed. Models can be trained on benchmark data (contamination). High scores on HumanEval do not guarantee good real-world code. And as models approach ceiling scores, benchmarks lose discriminative power.

Why it matters

sourc.dev publishes benchmark scores as verified attributes — not as rankings. The scores are data points, not verdicts. A high MMLU score is useful context. It is not a recommendation. The methodology page at [/methodology](/methodology) documents how each benchmark is sourced and verified.

Verified March 2026 · Source: MMLU paper, HumanEval paper, sourc.dev methodology

Related terms

MMLU HumanEval LLM

← All terms

← Reasoning models Model family →