SWE-Bench Verified Benchmark Leaderboard

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

Leaderboard

Top 50 models on SWE-Bench Verified Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Mythos Preview93.9% on SWE-Bench Verified Benchmark Leaderboard
  2. 2Claude Opus 4.787.6% on SWE-Bench Verified Benchmark Leaderboard
  3. 3Claude Opus 4.580.9% on SWE-Bench Verified Benchmark Leaderboard
  4. 4Claude Opus 4.680.8% on SWE-Bench Verified Benchmark Leaderboard
  5. 5Gemini 3.1 Pro80.6% on SWE-Bench Verified Benchmark Leaderboard
  6. 5DeepSeek-V4-Pro-Max80.6% on SWE-Bench Verified Benchmark Leaderboard
  7. 7Kimi K2.680.2% on SWE-Bench Verified Benchmark Leaderboard
  8. 7MiniMax M2.580.2% on SWE-Bench Verified Benchmark Leaderboard
  9. 9GPT-5.280.0% on SWE-Bench Verified Benchmark Leaderboard
  10. 10Claude Sonnet 4.679.6% on SWE-Bench Verified Benchmark Leaderboard
  11. 11DeepSeek-V4-Flash-Max79.0% on SWE-Bench Verified Benchmark Leaderboard
  12. 12Qwen3.6 Plus78.8% on SWE-Bench Verified Benchmark Leaderboard
  13. 13MiMo-V2-Pro78.0% on SWE-Bench Verified Benchmark Leaderboard
  14. 13Gemini 3 Flash78.0% on SWE-Bench Verified Benchmark Leaderboard
  15. 15GLM-577.8% on SWE-Bench Verified Benchmark Leaderboard
  16. 16Mistral Medium 3.577.6% on SWE-Bench Verified Benchmark Leaderboard
  17. 17Muse Spark77.4% on SWE-Bench Verified Benchmark Leaderboard
  18. 18Qwen3.6-27B77.2% on SWE-Bench Verified Benchmark Leaderboard
  19. 19Kimi K2.576.8% on SWE-Bench Verified Benchmark Leaderboard
  20. 20Seed 2.0 Pro76.5% on SWE-Bench Verified Benchmark Leaderboard
  21. 21Qwen3.5-397B-A17B76.4% on SWE-Bench Verified Benchmark Leaderboard
  22. 22GPT-5.1 Instant76.3% on SWE-Bench Verified Benchmark Leaderboard
  23. 22GPT-5.1 Thinking76.3% on SWE-Bench Verified Benchmark Leaderboard
  24. 22GPT-5.176.3% on SWE-Bench Verified Benchmark Leaderboard
  25. 25Gemini 3 Pro76.2% on SWE-Bench Verified Benchmark Leaderboard
  26. 26GPT-574.9% on SWE-Bench Verified Benchmark Leaderboard
  27. 27MiMo-V2-Omni74.8% on SWE-Bench Verified Benchmark Leaderboard
  28. 28Claude Opus 4.174.5% on SWE-Bench Verified Benchmark Leaderboard
  29. 28GPT-5 Codex74.5% on SWE-Bench Verified Benchmark Leaderboard
  30. 30Step-3.5-Flash74.4% on SWE-Bench Verified Benchmark Leaderboard
  31. 31GLM-4.773.8% on SWE-Bench Verified Benchmark Leaderboard
  32. 32GPT-5.1 Codex73.7% on SWE-Bench Verified Benchmark Leaderboard
  33. 33Seed 2.0 Lite73.5% on SWE-Bench Verified Benchmark Leaderboard
  34. 34Qwen3.6-35B-A3B73.4% on SWE-Bench Verified Benchmark Leaderboard
  35. 34MiMo-V2-Flash73.4% on SWE-Bench Verified Benchmark Leaderboard
  36. 36Claude Haiku 4.573.3% on SWE-Bench Verified Benchmark Leaderboard
  37. 37DeepSeek-V3.2-Speciale73.1% on SWE-Bench Verified Benchmark Leaderboard
  38. 37DeepSeek-V3.2 (Thinking)73.1% on SWE-Bench Verified Benchmark Leaderboard
  39. 37DeepSeek-V3.273.1% on SWE-Bench Verified Benchmark Leaderboard
  40. 40Claude Sonnet 472.7% on SWE-Bench Verified Benchmark Leaderboard
  41. 41Claude Opus 472.5% on SWE-Bench Verified Benchmark Leaderboard
  42. 42Qwen3.5-27B72.4% on SWE-Bench Verified Benchmark Leaderboard
  43. 43Qwen3.5-122B-A10B72.0% on SWE-Bench Verified Benchmark Leaderboard
  44. 44Kimi K2-Thinking-090571.3% on SWE-Bench Verified Benchmark Leaderboard
  45. 45Grok Code Fast 170.8% on SWE-Bench Verified Benchmark Leaderboard
  46. 46Claude 3.7 Sonnet70.3% on SWE-Bench Verified Benchmark Leaderboard
  47. 47LongCat-Flash-Thinking-260170.0% on SWE-Bench Verified Benchmark Leaderboard
  48. 48Qwen3-Coder 480B A35B Instruct69.6% on SWE-Bench Verified Benchmark Leaderboard
  49. 48Qwen3 Max69.6% on SWE-Bench Verified Benchmark Leaderboard
  50. 50MiniMax M269.4% on SWE-Bench Verified Benchmark Leaderboard

Models tracked

Models with swe-bench-verified in their evaluation profile.

View task leaderboards →