SWE-bench Multilingual Benchmark Leaderboard

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

Leaderboard

Top 27 models on SWE-bench Multilingual Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Claude Mythos Preview	87.3%	—
2	Claude Opus 4.6	77.8%	—
3	Kimi K2.6	76.7%	—
4	MiniMax M2.7	76.5%	—
5	DeepSeek-V4-Pro-Max	76.2%	—
6	Qwen3.6 Plus	73.8%	—
7	DeepSeek-V4-Flash-Max	73.3%	—
8	Kimi K2.5	73.0%	—
9	MiniMax M2.1	72.5%	—
10	MiMo-V2-Pro	71.7%	—
10	MiMo-V2-Flash	71.7%	—
12	Qwen3.6-27B	71.3%	—
13	DeepSeek-V3.2 (Thinking)	70.2%	—
13	DeepSeek-V3.2	70.2%	—
15	Qwen3.5-397B-A17B	69.3%	—
16	Qwen3.6-35B-A3B	67.2%	—
17	GLM-4.7	66.7%	—
18	Kimi K2-Thinking-0905	61.1%	—
19	DeepSeek-V3.2-Exp	57.9%	—
20	MiniMax M2	56.5%	—
21	Qwen3-Coder 480B A35B Instruct	54.7%	—
22	DeepSeek-V3.1	54.5%	—
23	Kimi K2-Instruct-0905	47.3%	—
23	Kimi K2 Instruct	47.3%	—
25	Nemotron 3 Super (120B A12B)	45.8%	—
26	LongCat-Flash-Lite	38.1%	—
27	DeepSeek-R1-0528	30.5%	—

Models tracked

Models with swe-bench-multilingual in their evaluation profile.

No models linked yet.

View task leaderboards →