Best AI for Coding

Build, debug, and ship software

50 models · 6 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

#	Model	Score	Input / 1M	Context	SWE	HumanEval	LCB	SWE Pro	Terminal	SWE-i18n
1	MiniCPM-SALA	95.1	—	256K	—	95.1%	—	—	—	—
2	Kimi K2 0905 Moonshot	94.5	—	256K	—	94.5%	—	—	—	—
3	Claude 3.5 Sonnet	93.7	—	—	—	93.7%	—	—	—	—
4	Qwen2.5-Coder 32B Instruct Qwen	92.7	—	128K	—	92.7%	—	—	—	—
5	o1-mini OpenAI	92.4	—	—	—	92.4%	—	—	—	—
6	Sarvam-30B	92.1	—	—	—	92.1%	—	—	—	—
7	Claude 3.5 Sonnet	92.0	—	—	—	92.0%	—	—	—	—
8	Mistral Large 2	92.0	—	128K	—	92.0%	—	—	—	—
9	Qwen2.5 VL 32B Instruct Qwen	91.5	—	—	—	91.5%	—	—	—	—
10	GPT-4o	90.2	—	—	—	90.2%	—	—	—	—
11	Granite 3.3 8B Base	89.7	—	128K	—	89.7%	—	—	—	—
12	Granite 3.3 8B Instruct	89.7	—	128K	—	89.7%	—	—	—	—
13	Gemini Diffusion Google	89.6	—	—	—	89.6%	—	—	—	—
14	DeepSeek-V2.5 DeepSeek	89.0	—	—	—	89.0%	—	—	—	—
15	Llama 3.1 405B Instruct	89.0	—	128K	—	89.0%	—	—	—	—
16	Nova Pro Amazon	89.0	—	—	—	89.0%	—	—	—	—
17	Mistral Small 3.1 24B Instruct Mistral	88.4	—	128K	—	88.4%	—	—	—	—
18	Grok-2	88.4	—	—	—	88.4%	—	—	—	—
19	Llama 3.3 70B Instruct	88.4	—	128K	—	88.4%	—	—	—	—
20	Qwen2.5 32B Instruct Qwen	88.4	—	8K	—	88.4%	—	—	—	—
21	Qwen2.5-Coder 7B Instruct Qwen	88.4	—	128K	—	88.4%	—	—	—	—
22	Claude 3.5 Haiku Anthropic	88.1	—	—	—	88.1%	—	—	—	—
23	o1	88.1	—	—	—	88.1%	—	—	—	—
24	GPT-4.5 OpenAI	88.0	—	128K	—	88.0%	—	—	—	—
25	Gemma 3 27B Google	87.8	—	128K	—	87.8%	—	—	—	—
26	GPT-4o mini OpenAI	87.2	—	128K	—	87.2%	—	—	—	—
27	GPT-4 Turbo	87.1	—	—	—	87.1%	—	—	—	—
28	Qwen2 72B Instruct Qwen	86.0	—	131K	—	86.0%	—	—	—	—
29	Grok-2 mini	85.7	—	—	—	85.7%	—	—	—	—
30	Gemma 3 12B Google	85.4	—	128K	—	85.4%	—	—	—	—
31	Nova Lite	85.4	—	—	—	85.4%	—	—	—	—
32	Claude Mythos Preview Anthropic	85.3	—	128K	93.9%	—	—	77.8%	82.0%	87.3%
33	Claude 3 Opus Anthropic	84.9	—	—	—	84.9%	—	—	—	—
34	Mistral Small 3 24B Instruct Mistral	84.8	—	—	—	84.8%	—	—	—	—
35	Qwen2.5 7B Instruct Qwen	84.8	—	8K	—	84.8%	—	—	—	—
36	GPT-5 OpenAI	84.2	—	128K	74.9%	93.4%	—	—	—	—
37	Gemini 1.5 Pro	84.1	—	—	—	84.1%	—	—	—	—
38	Qwen2.5 14B Instruct Qwen	83.5	—	128K	—	83.5%	—	—	—	—
39	Phi 4	82.6	—	—	—	82.6%	—	—	—	—
40	IBM Granite 4.0 Tiny Preview	82.4	—	128K	—	82.4%	—	—	—	—
41	Codestral-22B	81.1	—	—	—	81.1%	—	—	—	—
42	Nova Micro	81.1	—	—	—	81.1%	—	—	—	—
43	Llama 3.1 70B Instruct	80.5	—	—	—	80.5%	—	—	—	—
44	Grok-3 Mini xAI	80.4	—	—	—	—	80.4%	—	—	—
45	GPT-5.2	80.0	$1.75/1M	256K	80.0%	—	—	—	—	—
46	Grok 4 Fast	80.0	—	—	—	—	80.0%	—	—	—
47	Qwen2 7B Instruct Qwen	79.9	—	131K	—	79.9%	—	—	—	—
48	Grok-3 xAI	79.4	—	—	—	—	79.4%	—	—	—
49	Grok-4 Heavy	79.4	—	—	—	—	79.4%	—	—	—
50	LongCat-Flash-Thinking	79.4	—	—	—	—	79.4%	—	—	—

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.