SWE-bench Multilingual Benchmark Leaderboard

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

Leaderboard

Top 27 models on SWE-bench Multilingual Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Mythos Preview87.3% on SWE-bench Multilingual Benchmark Leaderboard
  2. 2Claude Opus 4.677.8% on SWE-bench Multilingual Benchmark Leaderboard
  3. 3Kimi K2.676.7% on SWE-bench Multilingual Benchmark Leaderboard
  4. 4MiniMax M2.776.5% on SWE-bench Multilingual Benchmark Leaderboard
  5. 5DeepSeek-V4-Pro-Max76.2% on SWE-bench Multilingual Benchmark Leaderboard
  6. 6Qwen3.6 Plus73.8% on SWE-bench Multilingual Benchmark Leaderboard
  7. 7DeepSeek-V4-Flash-Max73.3% on SWE-bench Multilingual Benchmark Leaderboard
  8. 8Kimi K2.573.0% on SWE-bench Multilingual Benchmark Leaderboard
  9. 9MiniMax M2.172.5% on SWE-bench Multilingual Benchmark Leaderboard
  10. 10MiMo-V2-Pro71.7% on SWE-bench Multilingual Benchmark Leaderboard
  11. 10MiMo-V2-Flash71.7% on SWE-bench Multilingual Benchmark Leaderboard
  12. 12Qwen3.6-27B71.3% on SWE-bench Multilingual Benchmark Leaderboard
  13. 13DeepSeek-V3.2 (Thinking)70.2% on SWE-bench Multilingual Benchmark Leaderboard
  14. 13DeepSeek-V3.270.2% on SWE-bench Multilingual Benchmark Leaderboard
  15. 15Qwen3.5-397B-A17B69.3% on SWE-bench Multilingual Benchmark Leaderboard
  16. 16Qwen3.6-35B-A3B67.2% on SWE-bench Multilingual Benchmark Leaderboard
  17. 17GLM-4.766.7% on SWE-bench Multilingual Benchmark Leaderboard
  18. 18Kimi K2-Thinking-090561.1% on SWE-bench Multilingual Benchmark Leaderboard
  19. 19DeepSeek-V3.2-Exp57.9% on SWE-bench Multilingual Benchmark Leaderboard
  20. 20MiniMax M256.5% on SWE-bench Multilingual Benchmark Leaderboard
  21. 21Qwen3-Coder 480B A35B Instruct54.7% on SWE-bench Multilingual Benchmark Leaderboard
  22. 22DeepSeek-V3.154.5% on SWE-bench Multilingual Benchmark Leaderboard
  23. 23Kimi K2-Instruct-090547.3% on SWE-bench Multilingual Benchmark Leaderboard
  24. 23Kimi K2 Instruct47.3% on SWE-bench Multilingual Benchmark Leaderboard
  25. 25Nemotron 3 Super (120B A12B)45.8% on SWE-bench Multilingual Benchmark Leaderboard
  26. 26LongCat-Flash-Lite38.1% on SWE-bench Multilingual Benchmark Leaderboard
  27. 27DeepSeek-R1-052830.5% on SWE-bench Multilingual Benchmark Leaderboard

Models tracked

Models with swe-bench-multilingual in their evaluation profile.

  • No models linked yet.

View task leaderboards →