Best Value

Models ranked by benchmark performance per dollar of input cost. Higher scores mean more capability for less money.

Methodology: Computed as benchmark_mmlu / input_price_per_1m. Only models with both values included. Higher is better.

# Model Metric
1 Llama 3.3 70B 860.0 score/$1
2 GPT-4o mini 546.7 score/$1
3 DeepSeek V3 327.8 score/$1
4 Claude 3 Haiku 300.8 score/$1
5 Mistral 7B 250.0 score/$1
6 Qwen 2.5 72B 215.2 score/$1
7 DeepSeek R1 165.1 score/$1
8 Gemini 1.0 Pro 158.2 score/$1
9 GPT-3.5 Turbo 140.0 score/$1
10 Mixtral 8x7B 100.9 score/$1
11 Llama 3 70B 91.1 score/$1
12 Llama 2 70B 76.6 score/$1
13 Gemini 1.5 Pro 68.7 score/$1
14 GPT-4o 35.5 score/$1
15 Claude 3.5 Sonnet 29.6 score/$1
16 Mistral Large 2 28.0 score/$1
17 Claude 3 Sonnet 26.3 score/$1
18 Command R+ 25.2 score/$1
19 Llama 3.1 405B 17.7 score/$1
20 GPT-4 Turbo 8.6 score/$1
21 o1 6.2 score/$1
22 Claude 3 Opus 5.8 score/$1
23 GPT-4 2.9 score/$1
24 GPT-3 (davinci-002) 0.7 score/$1
← All leaderboards