SWE-bench Multilingual Benchmark Leaderboard
A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.
Leaderboard
Top 27 models on SWE-bench Multilingual Benchmark Leaderboard (scores from public evaluations).
- 1Claude Mythos Preview87.3% on SWE-bench Multilingual Benchmark Leaderboard
- 2Claude Opus 4.677.8% on SWE-bench Multilingual Benchmark Leaderboard
- 3Kimi K2.676.7% on SWE-bench Multilingual Benchmark Leaderboard
- 4MiniMax M2.776.5% on SWE-bench Multilingual Benchmark Leaderboard
- 5DeepSeek-V4-Pro-Max76.2% on SWE-bench Multilingual Benchmark Leaderboard
- 6Qwen3.6 Plus73.8% on SWE-bench Multilingual Benchmark Leaderboard
- 7DeepSeek-V4-Flash-Max73.3% on SWE-bench Multilingual Benchmark Leaderboard
- 8Kimi K2.573.0% on SWE-bench Multilingual Benchmark Leaderboard
- 9MiniMax M2.172.5% on SWE-bench Multilingual Benchmark Leaderboard
- 10MiMo-V2-Pro71.7% on SWE-bench Multilingual Benchmark Leaderboard
- 10MiMo-V2-Flash71.7% on SWE-bench Multilingual Benchmark Leaderboard
- 12Qwen3.6-27B71.3% on SWE-bench Multilingual Benchmark Leaderboard
- 13DeepSeek-V3.2 (Thinking)70.2% on SWE-bench Multilingual Benchmark Leaderboard
- 13DeepSeek-V3.270.2% on SWE-bench Multilingual Benchmark Leaderboard
- 15Qwen3.5-397B-A17B69.3% on SWE-bench Multilingual Benchmark Leaderboard
- 16Qwen3.6-35B-A3B67.2% on SWE-bench Multilingual Benchmark Leaderboard
- 17GLM-4.766.7% on SWE-bench Multilingual Benchmark Leaderboard
- 18Kimi K2-Thinking-090561.1% on SWE-bench Multilingual Benchmark Leaderboard
- 19DeepSeek-V3.2-Exp57.9% on SWE-bench Multilingual Benchmark Leaderboard
- 20MiniMax M256.5% on SWE-bench Multilingual Benchmark Leaderboard
- 21Qwen3-Coder 480B A35B Instruct54.7% on SWE-bench Multilingual Benchmark Leaderboard
- 22DeepSeek-V3.154.5% on SWE-bench Multilingual Benchmark Leaderboard
- 23Kimi K2-Instruct-090547.3% on SWE-bench Multilingual Benchmark Leaderboard
- 23Kimi K2 Instruct47.3% on SWE-bench Multilingual Benchmark Leaderboard
- 25Nemotron 3 Super (120B A12B)45.8% on SWE-bench Multilingual Benchmark Leaderboard
- 26LongCat-Flash-Lite38.1% on SWE-bench Multilingual Benchmark Leaderboard
- 27DeepSeek-R1-052830.5% on SWE-bench Multilingual Benchmark Leaderboard
Models tracked
Models with swe-bench-multilingual in their evaluation profile.
- No models linked yet.