SWE-Bench Verified Benchmark Leaderboard
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Leaderboard
Top 50 models on SWE-Bench Verified Benchmark Leaderboard (scores from public evaluations).
- 1Claude Mythos Preview93.9% on SWE-Bench Verified Benchmark Leaderboard
- 2Claude Opus 4.787.6% on SWE-Bench Verified Benchmark Leaderboard
- 3Claude Opus 4.580.9% on SWE-Bench Verified Benchmark Leaderboard
- 4Claude Opus 4.680.8% on SWE-Bench Verified Benchmark Leaderboard
- 5Gemini 3.1 Pro80.6% on SWE-Bench Verified Benchmark Leaderboard
- 5DeepSeek-V4-Pro-Max80.6% on SWE-Bench Verified Benchmark Leaderboard
- 7Kimi K2.680.2% on SWE-Bench Verified Benchmark Leaderboard
- 7MiniMax M2.580.2% on SWE-Bench Verified Benchmark Leaderboard
- 9GPT-5.280.0% on SWE-Bench Verified Benchmark Leaderboard
- 10Claude Sonnet 4.679.6% on SWE-Bench Verified Benchmark Leaderboard
- 11DeepSeek-V4-Flash-Max79.0% on SWE-Bench Verified Benchmark Leaderboard
- 12Qwen3.6 Plus78.8% on SWE-Bench Verified Benchmark Leaderboard
- 13MiMo-V2-Pro78.0% on SWE-Bench Verified Benchmark Leaderboard
- 13Gemini 3 Flash78.0% on SWE-Bench Verified Benchmark Leaderboard
- 15GLM-577.8% on SWE-Bench Verified Benchmark Leaderboard
- 16Mistral Medium 3.577.6% on SWE-Bench Verified Benchmark Leaderboard
- 17Muse Spark77.4% on SWE-Bench Verified Benchmark Leaderboard
- 18Qwen3.6-27B77.2% on SWE-Bench Verified Benchmark Leaderboard
- 19Kimi K2.576.8% on SWE-Bench Verified Benchmark Leaderboard
- 20Seed 2.0 Pro76.5% on SWE-Bench Verified Benchmark Leaderboard
- 21Qwen3.5-397B-A17B76.4% on SWE-Bench Verified Benchmark Leaderboard
- 22GPT-5.1 Instant76.3% on SWE-Bench Verified Benchmark Leaderboard
- 22GPT-5.1 Thinking76.3% on SWE-Bench Verified Benchmark Leaderboard
- 22GPT-5.176.3% on SWE-Bench Verified Benchmark Leaderboard
- 25Gemini 3 Pro76.2% on SWE-Bench Verified Benchmark Leaderboard
- 26GPT-574.9% on SWE-Bench Verified Benchmark Leaderboard
- 27MiMo-V2-Omni74.8% on SWE-Bench Verified Benchmark Leaderboard
- 28Claude Opus 4.174.5% on SWE-Bench Verified Benchmark Leaderboard
- 28GPT-5 Codex74.5% on SWE-Bench Verified Benchmark Leaderboard
- 30Step-3.5-Flash74.4% on SWE-Bench Verified Benchmark Leaderboard
- 31GLM-4.773.8% on SWE-Bench Verified Benchmark Leaderboard
- 32GPT-5.1 Codex73.7% on SWE-Bench Verified Benchmark Leaderboard
- 33Seed 2.0 Lite73.5% on SWE-Bench Verified Benchmark Leaderboard
- 34Qwen3.6-35B-A3B73.4% on SWE-Bench Verified Benchmark Leaderboard
- 34MiMo-V2-Flash73.4% on SWE-Bench Verified Benchmark Leaderboard
- 36Claude Haiku 4.573.3% on SWE-Bench Verified Benchmark Leaderboard
- 37DeepSeek-V3.2-Speciale73.1% on SWE-Bench Verified Benchmark Leaderboard
- 37DeepSeek-V3.2 (Thinking)73.1% on SWE-Bench Verified Benchmark Leaderboard
- 37DeepSeek-V3.273.1% on SWE-Bench Verified Benchmark Leaderboard
- 40Claude Sonnet 472.7% on SWE-Bench Verified Benchmark Leaderboard
- 41Claude Opus 472.5% on SWE-Bench Verified Benchmark Leaderboard
- 42Qwen3.5-27B72.4% on SWE-Bench Verified Benchmark Leaderboard
- 43Qwen3.5-122B-A10B72.0% on SWE-Bench Verified Benchmark Leaderboard
- 44Kimi K2-Thinking-090571.3% on SWE-Bench Verified Benchmark Leaderboard
- 45Grok Code Fast 170.8% on SWE-Bench Verified Benchmark Leaderboard
- 46Claude 3.7 Sonnet70.3% on SWE-Bench Verified Benchmark Leaderboard
- 47LongCat-Flash-Thinking-260170.0% on SWE-Bench Verified Benchmark Leaderboard
- 48Qwen3-Coder 480B A35B Instruct69.6% on SWE-Bench Verified Benchmark Leaderboard
- 48Qwen3 Max69.6% on SWE-Bench Verified Benchmark Leaderboard
- 50MiniMax M269.4% on SWE-Bench Verified Benchmark Leaderboard
Models tracked
Models with swe-bench-verified in their evaluation profile.
- ChatGPT-4o Latest
- Claude 3.5 HaikuAnthropic
- Claude 3.5 Sonnet
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
- Claude 3 Haiku
- Claude 3 OpusAnthropic
- Claude 3 Sonnet
- Claude Haiku 4.5Anthropic
- Claude Mythos PreviewAnthropic
- Claude Opus 4.1
- Claude Opus 4Anthropic
- Claude Opus 4.5
- Claude Opus 4.6Anthropic
- Claude Opus 4.7Anthropic
- Claude Sonnet 4
- Claude Sonnet 4.5
- Claude Sonnet 4.6
- Codestral-22B
- Command R+
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- DeepSeek-V3.2 (Non-thinking)DeepSeek
- DeepSeek-R1-0528DeepSeek
- DeepSeek R1 Distill Llama 70BDeepSeek
- DeepSeek R1 Distill Llama 8BDeepSeek
- DeepSeek R1 Distill Qwen 14BDeepSeek
- DeepSeek R1 Distill Qwen 32BDeepSeek
- DeepSeek R1 Distill Qwen 7BDeepSeek
- DeepSeek R1 ZeroOpenAI
- DeepSeek-V3.2 (Thinking)DeepSeek
- DeepSeek-V2.5DeepSeek
- DeepSeek-V3 0324
- DeepSeek-V3.1DeepSeek
- DeepSeek-V3.2-ExpDeepSeek
- DeepSeek-V3.2-SpecialeDeepSeek
- DeepSeek-V3.2DeepSeek
- DeepSeek-V3
- DeepSeek-V4-Flash-MaxDeepSeek
- DeepSeek-V4-Pro-MaxDeepSeek
- DeepSeek VL2 SmallDeepSeek
- DeepSeek VL2 TinyDeepSeek
- DeepSeek VL2DeepSeek
- ERNIE 4.5
- ERNIE 5.0
- Gemini 1.0 Pro
- Gemini 1.5 Flash 8B