Highest Benchmark

Models ranked by MMLU (Massive Multitask Language Understanding) score. MMLU tests reasoning across 57 academic subjects — scores above 90% exceed average human expert performance.

Methodology: Sorted by benchmark_mmlu descending. Human expert average is approximately 89.8%. Not all models have published MMLU scores.

#	Model	Provider	Metric
1	o1	OpenAI	92.3% MMLU
2	DeepSeek R1	DeepSeek	90.8% MMLU
3	GPT-4o	OpenAI	88.7% MMLU
4	Claude 3.5 Sonnet	Anthropic	88.7% MMLU
5	Llama 3.1 405B	Meta	88.6% MMLU
6	DeepSeek V3	DeepSeek	88.5% MMLU
7	Claude 3 Opus	Anthropic	86.8% MMLU
8	GPT-4	OpenAI	86.4% MMLU
9	GPT-4 Turbo	OpenAI	86.4% MMLU
10	Qwen 2.5 72B	Alibaba Cloud (Qwen)	86.1% MMLU
11	Llama 3.3 70B	Meta	86.0% MMLU
12	Gemini 1.5 Pro	Google DeepMind	85.9% MMLU
13	Mistral Large 2	Mistral AI	84.0% MMLU
14	GPT-4o mini	OpenAI	82.0% MMLU
15	Llama 3 70B	Meta	82.0% MMLU
16	Gemini 1.0 Pro	Google DeepMind	79.1% MMLU
17	Claude 3 Sonnet	Anthropic	79.0% MMLU
18	Command R+	Cohere	75.7% MMLU
19	Claude 3 Haiku	Anthropic	75.2% MMLU
20	Mixtral 8x7B	Mistral AI	70.6% MMLU
21	GPT-3.5 Turbo	OpenAI	70.0% MMLU
22	Llama 2 70B	Meta	68.9% MMLU
23	Mistral 7B	Mistral AI	62.5% MMLU
24	GPT-3 (davinci-002)	OpenAI	43.9% MMLU

← All leaderboards