SWE-Bench Pro Benchmark Leaderboard
SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
Leaderboard
Top 21 models on SWE-Bench Pro Benchmark Leaderboard (scores from public evaluations).
- 1Claude Mythos Preview77.8% on SWE-Bench Pro Benchmark Leaderboard
- 2Claude Opus 4.764.3% on SWE-Bench Pro Benchmark Leaderboard
- 3GPT-5.558.6% on SWE-Bench Pro Benchmark Leaderboard
- 3Kimi K2.658.6% on SWE-Bench Pro Benchmark Leaderboard
- 5GLM-5.158.4% on SWE-Bench Pro Benchmark Leaderboard
- 6GPT-5.457.7% on SWE-Bench Pro Benchmark Leaderboard
- 7GPT-5.3 Codex56.8% on SWE-Bench Pro Benchmark Leaderboard
- 8Qwen3.6 Plus56.6% on SWE-Bench Pro Benchmark Leaderboard
- 9GPT-5.2 Codex56.4% on SWE-Bench Pro Benchmark Leaderboard
- 10MiniMax M2.756.2% on SWE-Bench Pro Benchmark Leaderboard
- 11MiniMax M2.555.4% on SWE-Bench Pro Benchmark Leaderboard
- 11DeepSeek-V4-Pro-Max55.4% on SWE-Bench Pro Benchmark Leaderboard
- 13Gemini 3.5 Flash55.1% on SWE-Bench Pro Benchmark Leaderboard
- 14GPT-5.4 mini54.4% on SWE-Bench Pro Benchmark Leaderboard
- 15Gemini 3.1 Pro54.2% on SWE-Bench Pro Benchmark Leaderboard
- 16Qwen3.6-27B53.5% on SWE-Bench Pro Benchmark Leaderboard
- 17DeepSeek-V4-Flash-Max52.6% on SWE-Bench Pro Benchmark Leaderboard
- 18GPT-5.4 nano52.4% on SWE-Bench Pro Benchmark Leaderboard
- 18Muse Spark52.4% on SWE-Bench Pro Benchmark Leaderboard
- 20Kimi K2.550.7% on SWE-Bench Pro Benchmark Leaderboard
- 21Qwen3.6-35B-A3B49.5% on SWE-Bench Pro Benchmark Leaderboard
Models tracked
Models with swe-bench-pro in their evaluation profile.
- No models linked yet.