#50 of 50

Benchmark gaming

Why the highest score might not be the best model

What is benchmark gaming

Benchmark gaming is the practice of optimising a model's training or evaluation process to achieve higher benchmark scores without proportionally improving real-world capability. This can happen through data contamination (training on benchmark questions), overfitting to test formats, or selecting evaluation conditions that favour the model.

The honest case for benchmarks: even with gaming, benchmarks provide a consistent comparison framework. A model that scores 90% on MMLU — even if the score is slightly inflated — is likely more capable than one scoring 60%.

The honest case for scepticism: a 2-point improvement in MMLU does not mean a 2-point improvement in your use case. Models are increasingly optimised for benchmark performance specifically, and the gap between benchmark scores and real-world utility is widening. GPT-4 scored 86.4% on MMLU at launch. Some newer models score higher but perform worse on practical tasks.

Why it matters

sourc.dev publishes benchmark scores as data, not as rankings. The methodology page documents how each score is sourced. The editorial position is explicit: benchmarks are one signal among many. No entity page on sourc.dev recommends a model based on benchmark scores alone.

Verified March 2026 · Source: MMLU contamination research, sourc.dev methodology

Related terms

MMLU HumanEval LLM

← All terms

← GDPR