Best Value
Models ranked by benchmark performance per dollar of input cost. Higher scores mean more capability for less money.
Methodology: Computed as benchmark_mmlu / input_price_per_1m. Only models with both values included. Higher is better.
| # | Model | Provider | Metric |
|---|---|---|---|
| 1 | Llama 3.3 70B | Meta | 860.0 score/$1 |
| 2 | GPT-4o mini | OpenAI | 546.7 score/$1 |
| 3 | DeepSeek V3 | DeepSeek | 327.8 score/$1 |
| 4 | Claude 3 Haiku | Anthropic | 300.8 score/$1 |
| 5 | Mistral 7B | Mistral AI | 250.0 score/$1 |
| 6 | Qwen 2.5 72B | Alibaba Cloud (Qwen) | 215.2 score/$1 |
| 7 | DeepSeek R1 | DeepSeek | 165.1 score/$1 |
| 8 | Gemini 1.0 Pro | Google DeepMind | 158.2 score/$1 |
| 9 | GPT-3.5 Turbo | OpenAI | 140.0 score/$1 |
| 10 | Mixtral 8x7B | Mistral AI | 100.9 score/$1 |
| 11 | Llama 3 70B | Meta | 91.1 score/$1 |
| 12 | Llama 2 70B | Meta | 76.6 score/$1 |
| 13 | Gemini 1.5 Pro | Google DeepMind | 68.7 score/$1 |
| 14 | GPT-4o | OpenAI | 35.5 score/$1 |
| 15 | Claude 3.5 Sonnet | Anthropic | 29.6 score/$1 |
| 16 | Mistral Large 2 | Mistral AI | 28.0 score/$1 |
| 17 | Claude 3 Sonnet | Anthropic | 26.3 score/$1 |
| 18 | Command R+ | Cohere | 25.2 score/$1 |
| 19 | Llama 3.1 405B | Meta | 17.7 score/$1 |
| 20 | GPT-4 Turbo | OpenAI | 8.6 score/$1 |
| 21 | o1 | OpenAI | 6.2 score/$1 |
| 22 | Claude 3 Opus | Anthropic | 5.8 score/$1 |
| 23 | GPT-4 | OpenAI | 2.9 score/$1 |
| 24 | GPT-3 (davinci-002) | OpenAI | 0.7 score/$1 |