SWE-bench Leaderboard

Every major AI model ranked by SWE-bench — the benchmark that measures whether a model can resolve real GitHub issues, not toy problems. Pricing and context columns are included because a leaderboard you can't act on is just trivia.

Scores from provider publications and public leaderboards · Pricing verified daily

Current leader

Claude Fable 593.4%Anthropic · $10/1M input · 1M tokens context

Best value in the top 10: GPT-5.4 — 74.9% at $0.19999999999999998/1M input tokens.

#	Model	Provider	SWE-bench	Input $/1M	Context
1	Claude Fable 5	Anthropic	93.4%	$10	1M tokens
2	Claude Mythos 5	Anthropic	93.4%	$10	1M tokens
3	Claude Opus 4.8	Anthropic	88.6%	$10	1M tokens
4	Claude Opus 4.7	Anthropic	80%	$5	1M tokens
5	Claude Sonnet 4.6	Anthropic	79.6%	$3	1M tokens
6	GPT-5.4	OpenAI	74.9%	$0.19999999999999998	272k tokens
7	Claude Opus 4.6	Anthropic	72.5%	$15	1M tokens
8	Gemini 3.1 Pro	Google	63.2%	$2	2M tokens
9	Grok 4	xAI	54%	$1.25	2M tokens
10	DeepSeek R1	DeepSeek	49.2%	$0.55	128k tokens
11	GPT-4o	OpenAI	46%	$0.15	128k tokens
12	Claude Haiku 4	Anthropic	43%	$0.8	200k tokens
13	DeepSeek V3	DeepSeek	42%	$0.27	128k tokens
14	Gemini 3.1 Flash	Google	35%	$0.25	1M tokens
15	Llama 4 Maverick	Meta	32%	$0.15	256k tokens
16	Mistral Large 2	Mistral	28%	$2	128k tokens
17	GPT-4o Mini	OpenAI	23.6%	$0.15	128k tokens

All benchmark scores →Best AI for coding →Claude Fable 5 alternatives →API cost calculator →

FAQ

What is SWE-bench?

SWE-bench is a benchmark that tests whether an AI model can resolve real GitHub issues from real open-source repositories — write a patch, pass the tests. Unlike puzzle-style benchmarks, it measures the messy, multi-file work software engineers actually do, which is why it has become the standard for comparing coding models.

Which AI model has the highest SWE-bench score?

Claude Fable 5 (Anthropic) currently leads at 93.4%, ahead of Claude Mythos 5 at 93.4%.

What is a good SWE-bench score?

Anything above 70% is frontier-class in 2026 — the model can resolve most real GitHub issues autonomously. The current leader, Claude Fable 5, is at 93.4%. Two years ago the best models scored under 20%, which is how fast this benchmark moves.

Does the highest SWE-bench score mean the best coding model for me?

Not always. Score-per-dollar matters for daily work: GPT-5.4 delivers 74.9% at $0.19999999999999998/1M input tokens, which is the best value in the top 10. Reserve the outright leader for the hardest tasks and route volume work to the value pick.