Highest Benchmark
Models ranked by MMLU (Massive Multitask Language Understanding) score. MMLU tests reasoning across 57 academic subjects — scores above 90% exceed average human expert performance.
Methodology: Sorted by benchmark_mmlu descending. Human expert average is approximately 89.8%. Not all models have published MMLU scores.
| # | Model | Provider | Metric |
|---|---|---|---|
| 1 | o1 | OpenAI | 92.3% MMLU |
| 2 | DeepSeek R1 | DeepSeek | 90.8% MMLU |
| 3 | GPT-4o | OpenAI | 88.7% MMLU |
| 4 | Claude 3.5 Sonnet | Anthropic | 88.7% MMLU |
| 5 | Llama 3.1 405B | Meta | 88.6% MMLU |
| 6 | DeepSeek V3 | DeepSeek | 88.5% MMLU |
| 7 | Claude 3 Opus | Anthropic | 86.8% MMLU |
| 8 | GPT-4 | OpenAI | 86.4% MMLU |
| 9 | GPT-4 Turbo | OpenAI | 86.4% MMLU |
| 10 | Qwen 2.5 72B | Alibaba Cloud (Qwen) | 86.1% MMLU |
| 11 | Llama 3.3 70B | Meta | 86.0% MMLU |
| 12 | Gemini 1.5 Pro | Google DeepMind | 85.9% MMLU |
| 13 | Mistral Large 2 | Mistral AI | 84.0% MMLU |
| 14 | GPT-4o mini | OpenAI | 82.0% MMLU |
| 15 | Llama 3 70B | Meta | 82.0% MMLU |
| 16 | Gemini 1.0 Pro | Google DeepMind | 79.1% MMLU |
| 17 | Claude 3 Sonnet | Anthropic | 79.0% MMLU |
| 18 | Command R+ | Cohere | 75.7% MMLU |
| 19 | Claude 3 Haiku | Anthropic | 75.2% MMLU |
| 20 | Mixtral 8x7B | Mistral AI | 70.6% MMLU |
| 21 | GPT-3.5 Turbo | OpenAI | 70.0% MMLU |
| 22 | Llama 2 70B | Meta | 68.9% MMLU |
| 23 | Mistral 7B | Mistral AI | 62.5% MMLU |
| 24 | GPT-3 (davinci-002) | OpenAI | 43.9% MMLU |