SWE-Bench Pro Benchmark Leaderboard

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

Leaderboard

Top 21 models on SWE-Bench Pro Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Mythos Preview77.8% on SWE-Bench Pro Benchmark Leaderboard
  2. 2Claude Opus 4.764.3% on SWE-Bench Pro Benchmark Leaderboard
  3. 3GPT-5.558.6% on SWE-Bench Pro Benchmark Leaderboard
  4. 3Kimi K2.658.6% on SWE-Bench Pro Benchmark Leaderboard
  5. 5GLM-5.158.4% on SWE-Bench Pro Benchmark Leaderboard
  6. 6GPT-5.457.7% on SWE-Bench Pro Benchmark Leaderboard
  7. 7GPT-5.3 Codex56.8% on SWE-Bench Pro Benchmark Leaderboard
  8. 8Qwen3.6 Plus56.6% on SWE-Bench Pro Benchmark Leaderboard
  9. 9GPT-5.2 Codex56.4% on SWE-Bench Pro Benchmark Leaderboard
  10. 10MiniMax M2.756.2% on SWE-Bench Pro Benchmark Leaderboard
  11. 11MiniMax M2.555.4% on SWE-Bench Pro Benchmark Leaderboard
  12. 11DeepSeek-V4-Pro-Max55.4% on SWE-Bench Pro Benchmark Leaderboard
  13. 13Gemini 3.5 Flash55.1% on SWE-Bench Pro Benchmark Leaderboard
  14. 14GPT-5.4 mini54.4% on SWE-Bench Pro Benchmark Leaderboard
  15. 15Gemini 3.1 Pro54.2% on SWE-Bench Pro Benchmark Leaderboard
  16. 16Qwen3.6-27B53.5% on SWE-Bench Pro Benchmark Leaderboard
  17. 17DeepSeek-V4-Flash-Max52.6% on SWE-Bench Pro Benchmark Leaderboard
  18. 18GPT-5.4 nano52.4% on SWE-Bench Pro Benchmark Leaderboard
  19. 18Muse Spark52.4% on SWE-Bench Pro Benchmark Leaderboard
  20. 20Kimi K2.550.7% on SWE-Bench Pro Benchmark Leaderboard
  21. 21Qwen3.6-35B-A3B49.5% on SWE-Bench Pro Benchmark Leaderboard

Models tracked

Models with swe-bench-pro in their evaluation profile.

  • No models linked yet.

View task leaderboards →