MMLU benchmark

%

Massive Multitask Language Understanding — 57-subject knowledge test, 0-100%

What is mmlu benchmark?

MMLU — Massive Multitask Language Understanding — tests reasoning across 57 academic subjects including maths, history, law, medicine, and computer science. The average human expert scores approximately 89.8%.

Why it matters

MMLU provides a standardised way to compare reasoning ability across models. While no single benchmark tells the full story, MMLU covers enough subjects to give a useful signal about general knowledge and reasoning capability. It is the most widely cited benchmark in model announcements.

Where models stand

  1. #1 o1
    92.3 %
  2. 90.8 %
  3. 88.7 %
  4. 88.7 %
  5. 88.6 %

Data available for 24 of 30 tracked models.

How sourc.dev tracks this

sourc.dev verifies mmlu benchmark manually from official provider documentation, API responses, and published specifications. Every data point includes a source URL and verification date. When a value changes, the old value is preserved in the history table and the new value is recorded alongside it. Nothing is overwritten — the full timeline is always available.

Related

Frequently asked questions

What subjects does MMLU cover?

57 subjects spanning STEM, humanities, social sciences, and professional domains — including abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, computer security, econometrics, jurisprudence, and virology, among others.

What is a good MMLU score?

Human expert performance averages approximately 89.8%. Leading models now score above 85%, with some exceeding 90%. A score above 70% indicates strong general knowledge. Below 50% is roughly random guessing on four-choice questions.

Is MMLU still a useful benchmark?

MMLU remains the most widely cited benchmark for general reasoning. However, as top models approach and exceed human expert scores, its discriminative power is decreasing. Newer benchmarks like GPQA and MATH target harder problems.

← Back to Knowledge Library