Best AI for Coding
Build, debug, and ship software
50 models · 6 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.
- 1
MiniCPM-SALA
Score 95.1 · — input · 256K context
HumanEval 95.1% - 2
Kimi K2 0905
Score 94.5 · — input · 256K context
HumanEval 94.5% - 3
Claude 3.5 Sonnet
Score 93.7 · — input · — context
HumanEval 93.7% - 4
Qwen2.5-Coder 32B Instruct
Score 92.7 · — input · 128K context
HumanEval 92.7% - 5
o1-mini
Score 92.4 · — input · — context
HumanEval 92.4% - 6
Sarvam-30B
Score 92.1 · — input · — context
HumanEval 92.1% - 7
Claude 3.5 Sonnet
Score 92.0 · — input · — context
HumanEval 92.0% - 8
Mistral Large 2
Score 92.0 · — input · 128K context
HumanEval 92.0% - 9
Qwen2.5 VL 32B Instruct
Score 91.5 · — input · — context
HumanEval 91.5% - 10
GPT-4o
Score 90.2 · — input · — context
HumanEval 90.2% - 11
Granite 3.3 8B Base
Score 89.7 · — input · 128K context
HumanEval 89.7% - 12
Granite 3.3 8B Instruct
Score 89.7 · — input · 128K context
HumanEval 89.7% - 13
Gemini Diffusion
Score 89.6 · — input · — context
HumanEval 89.6% - 14
DeepSeek-V2.5
Score 89.0 · — input · — context
HumanEval 89.0% - 15
Llama 3.1 405B Instruct
Score 89.0 · — input · 128K context
HumanEval 89.0% - 16
Nova Pro
Score 89.0 · — input · — context
HumanEval 89.0% - 17
Mistral Small 3.1 24B Instruct
Score 88.4 · — input · 128K context
HumanEval 88.4% - 18
Grok-2
Score 88.4 · — input · — context
HumanEval 88.4% - 19
Llama 3.3 70B Instruct
Score 88.4 · — input · 128K context
HumanEval 88.4% - 20
Qwen2.5 32B Instruct
Score 88.4 · — input · 8K context
HumanEval 88.4% - 21
Qwen2.5-Coder 7B Instruct
Score 88.4 · — input · 128K context
HumanEval 88.4% - 22
Claude 3.5 Haiku
Score 88.1 · — input · — context
HumanEval 88.1% - 23
o1
Score 88.1 · — input · — context
HumanEval 88.1% - 24
GPT-4.5
Score 88.0 · — input · 128K context
HumanEval 88.0% - 25
Gemma 3 27B
Score 87.8 · — input · 128K context
HumanEval 87.8% - 26
GPT-4o mini
Score 87.2 · — input · 128K context
HumanEval 87.2% - 27
GPT-4 Turbo
Score 87.1 · — input · — context
HumanEval 87.1% - 28
Qwen2 72B Instruct
Score 86.0 · — input · 131K context
HumanEval 86.0% - 29
Grok-2 mini
Score 85.7 · — input · — context
HumanEval 85.7% - 30
Gemma 3 12B
Score 85.4 · — input · 128K context
HumanEval 85.4% - 31
Nova Lite
Score 85.4 · — input · — context
HumanEval 85.4% - 32
Claude Mythos Preview
Score 85.3 · — input · 128K context
SWE 93.9%SWE Pro 77.8%Terminal 82.0%SWE-i18n 87.3% - 33
Claude 3 Opus
Score 84.9 · — input · — context
HumanEval 84.9% - 34
Mistral Small 3 24B Instruct
Score 84.8 · — input · — context
HumanEval 84.8% - 35
Qwen2.5 7B Instruct
Score 84.8 · — input · 8K context
HumanEval 84.8% - 36
GPT-5
Score 84.2 · — input · 128K context
SWE 74.9%HumanEval 93.4% - 37
Gemini 1.5 Pro
Score 84.1 · — input · — context
HumanEval 84.1% - 38
Qwen2.5 14B Instruct
Score 83.5 · — input · 128K context
HumanEval 83.5% - 39
Phi 4
Score 82.6 · — input · — context
HumanEval 82.6% - 40
IBM Granite 4.0 Tiny Preview
Score 82.4 · — input · 128K context
HumanEval 82.4% - 41
Codestral-22B
Score 81.1 · — input · — context
HumanEval 81.1% - 42
Nova Micro
Score 81.1 · — input · — context
HumanEval 81.1% - 43
Llama 3.1 70B Instruct
Score 80.5 · — input · — context
HumanEval 80.5% - 44
Grok-3 Mini
Score 80.4 · — input · — context
LCB 80.4% - 45
GPT-5.2
Score 80.0 · $1.75/1M input · 256K context
SWE 80.0% - 46
Grok 4 Fast
Score 80.0 · — input · — context
LCB 80.0% - 47
Qwen2 7B Instruct
Score 79.9 · — input · 131K context
HumanEval 79.9% - 48
Grok-3
Score 79.4 · — input · — context
LCB 79.4% - 49
Grok-4 Heavy
Score 79.4 · — input · — context
LCB 79.4% - 50
LongCat-Flash-Thinking
Score 79.4 · — input · — context
LCB 79.4%
How this table works
Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.
Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.
Looking for SaaS tools? Browse categories or compare tools.