Benchmarks
Compare MMLU, HumanEval, SWE-bench, GPQA, and MATH scores across all major AI models. Click any column to sort. Click a benchmark name to learn what it measures.
Scores are reported values from provider papers and public leaderboards · Updated as new results are published
| Model | Details | ||||||
|---|---|---|---|---|---|---|---|
Claude Sonnet 4.6 Anthropic | 88.3% | 90.1% | 79.6% | 68% | 85.1% | 1,340 | Full review → |
GPT-5.4 OpenAI | 91% | 91.5% | 74.9% | 75.4% | 91% | 1,355 | Full review → |
Claude Opus 4.6 Anthropic | 90.4% | 92% | 72.5% | 74.9% | 89.3% | 1,360 | Full review → |
Gemini 2.5 Pro Google | 90% | 92% | 63.2% | 84% | 91.6% | 1,380 | Full review → |
Grok 4 xAI | 87.5% | 88% | 54% | 72% | 87% | 1,305 | Full review → |
DeepSeek R1 DeepSeek | 90.8% | 92% | 49.2% | 71.5% | 97.3% | 1,320 | Full review → |
GPT-4o OpenAI | 88.7% | 90.2% | 46% | 53.6% | 76.6% | 1,295 | Full review → |
Claude Haiku 4 Anthropic | 80% | 84% | 43% | 41.5% | 71% | 1,210 | Full review → |
DeepSeek V3 DeepSeek | 88.5% | 90.2% | 42% | 59.1% | 90.2% | 1,305 | Full review → |
Gemini 2.0 Flash Google | 84% | 86.5% | 35% | 51% | 78.4% | 1,265 | Full review → |
Llama 4 Maverick Meta | 85.5% | 87.5% | 32% | 52% | 80.5% | 1,250 | Full review → |
Mistral Large 2 Mistral | 84% | 92% | 28% | 49.6% | 72% | 1,225 | Full review → |
GPT-4o Mini OpenAI | 82% | 87.2% | 23.6% | 40.2% | 70.2% | 1,235 | Full review → |
Benchmarks measure specific, testable capabilities — not overall "intelligence" or real-world usefulness. A model that tops SWE-bench may still frustrate developers with its API latency or context handling. Use these scores as one signal, not the final word. Our model reviews combine benchmarks with practical verdict assessments for a fuller picture.