MMLU
Massive Multitask Language Understanding benchmark score. Tests knowledge across 57 academic subjects. Scores above 90 indicate frontier-level capability. Not directly comparable across model architectures.
Filter by value
Distribution — click row to filter
822→
88.72→
75.22→
86.42→
88.52→
861→
88.61→
68.91→
88.31→
791→
86.81→
87.51→
70.61→
841→
62.51→
79.11→
87.91→
85.91→
78.91→
88.91→
701→
92.31→
81.51→
75.71→
85.11→
86.11→
90.81→
43.91→
33 entities
| Entity ↕ | MMLU ↕ | Type | Integrations ↓ |
|---|---|---|---|
| Claude 3 Sonnet | 79 | — | 48 |
| o1 | 92.3 | — | 48 |
| GPT-4 Turbo | 86.4 | — | 47 |
| Kimi K2.5 | 85.1 | — | 44 |
| Mixtral 8x7B | 70.6 | — | 43 |
| GPT-4o | 88.7 | — | 37 |
| Gemini 1.0 Pro | 79.1 | — | 34 |
| Gemini 1.5 Flash | 78.9 | — | 34 |
| Mistral 7B | 62.5 | — | 33 |
| Llama 2 70B | 68.9 | — | 32 |
| DeepSeek R1 | 90.8 | — | 32 |
| Mistral Large 2 | 84 | — | 30 |
| Qwen 2.5 72B | 86.1 | — | 30 |
| Llama 3 70B | 82 | — | 29 |
| Claude 3.5 Sonnet | 88.7 | — | 29 |
| Llama 3.1 405B | 88.6 | — | 28 |
| Grok 2 | 87.5 | — | 28 |
| GPT-4 | 86.4 | — | 27 |
| GPT-3.5 Turbo | 70 | — | 26 |
| Gemini 2.0 Flash | 87.9 | — | 24 |
| Qwen3 235B A22B | 88.9 | — | 24 |
| Claude Sonnet 4.6 | 88.3 | — | 21 |
| DeepSeek V3.2 | 88.5 | — | 21 |
| Command R+ | 75.7 | — | 21 |
| DeepSeek V3 | 88.5 | — | 18 |
| Llama 3.3 70B | 86 | — | 17 |
| Claude 3.5 Haiku | 75.2 | — | 17 |
| GPT-4o mini | 82 | — | 17 |
| ERNIE 4.0 | 81.5 | — | 17 |
| Claude 3 Haiku | 75.2 | — | 16 |
| GPT-3 (davinci-002) | 43.9 | — | 16 |
| Claude 3 Opus | 86.8 | — | 15 |
| Gemini 1.5 Pro | 85.9 | — | 14 |