SWE-Bench Verified Benchmark Leaderboard

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

Leaderboard

Top 50 models on SWE-Bench Verified Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Claude Mythos Preview	93.9%	—
2	Claude Opus 4.7	87.6%	—
3	Claude Opus 4.5	80.9%	—
4	Claude Opus 4.6	80.8%	—
5	Gemini 3.1 Pro	80.6%	—
5	DeepSeek-V4-Pro-Max	80.6%	—
7	Kimi K2.6	80.2%	—
7	MiniMax M2.5	80.2%	—
9	GPT-5.2	80.0%	—
10	Claude Sonnet 4.6	79.6%	—
11	DeepSeek-V4-Flash-Max	79.0%	—
12	Qwen3.6 Plus	78.8%	—
13	MiMo-V2-Pro	78.0%	—
13	Gemini 3 Flash	78.0%	—
15	GLM-5	77.8%	—
16	Mistral Medium 3.5	77.6%	—
17	Muse Spark	77.4%	—
18	Qwen3.6-27B	77.2%	—
19	Kimi K2.5	76.8%	—
20	Seed 2.0 Pro	76.5%	—
21	Qwen3.5-397B-A17B	76.4%	—
22	GPT-5.1 Instant	76.3%	—
22	GPT-5.1 Thinking	76.3%	—
22	GPT-5.1	76.3%	—
25	Gemini 3 Pro	76.2%	—
26	GPT-5	74.9%	—
27	MiMo-V2-Omni	74.8%	—
28	Claude Opus 4.1	74.5%	—
28	GPT-5 Codex	74.5%	—
30	Step-3.5-Flash	74.4%	—
31	GLM-4.7	73.8%	—
32	GPT-5.1 Codex	73.7%	—
33	Seed 2.0 Lite	73.5%	—
34	Qwen3.6-35B-A3B	73.4%	—
34	MiMo-V2-Flash	73.4%	—
36	Claude Haiku 4.5	73.3%	—
37	DeepSeek-V3.2-Speciale	73.1%	—
37	DeepSeek-V3.2 (Thinking)	73.1%	—
37	DeepSeek-V3.2	73.1%	—
40	Claude Sonnet 4	72.7%	—
41	Claude Opus 4	72.5%	—
42	Qwen3.5-27B	72.4%	—
43	Qwen3.5-122B-A10B	72.0%	—
44	Kimi K2-Thinking-0905	71.3%	—
45	Grok Code Fast 1	70.8%	—
46	Claude 3.7 Sonnet	70.3%	—
47	LongCat-Flash-Thinking-2601	70.0%	—
48	Qwen3-Coder 480B A35B Instruct	69.6%	—
48	Qwen3 Max	69.6%	—
50	MiniMax M2	69.4%	—

Models tracked

Models with swe-bench-verified in their evaluation profile.

View task leaderboards →