Highest Benchmark

Models ranked by MMLU (Massive Multitask Language Understanding) score. MMLU tests reasoning across 57 academic subjects — scores above 90% exceed average human expert performance.

Methodology: Sorted by benchmark_mmlu descending. Human expert average is approximately 89.8%. Not all models have published MMLU scores.

# Model Metric
1 o1 92.3% MMLU
2 DeepSeek R1 90.8% MMLU
3 GPT-4o 88.7% MMLU
4 Claude 3.5 Sonnet 88.7% MMLU
5 Llama 3.1 405B 88.6% MMLU
6 DeepSeek V3 88.5% MMLU
7 Claude 3 Opus 86.8% MMLU
8 GPT-4 86.4% MMLU
9 GPT-4 Turbo 86.4% MMLU
10 Qwen 2.5 72B 86.1% MMLU
11 Llama 3.3 70B 86.0% MMLU
12 Gemini 1.5 Pro 85.9% MMLU
13 Mistral Large 2 84.0% MMLU
14 GPT-4o mini 82.0% MMLU
15 Llama 3 70B 82.0% MMLU
16 Gemini 1.0 Pro 79.1% MMLU
17 Claude 3 Sonnet 79.0% MMLU
18 Command R+ 75.7% MMLU
19 Claude 3 Haiku 75.2% MMLU
20 Mixtral 8x7B 70.6% MMLU
21 GPT-3.5 Turbo 70.0% MMLU
22 Llama 2 70B 68.9% MMLU
23 Mistral 7B 62.5% MMLU
24 GPT-3 (davinci-002) 43.9% MMLU
← All leaderboards