Best AI for Overall
Top models across every benchmark
50 models · 5 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.
- 1
MiniCPM-SALA
Score 95.1 · — input · 256K context
HumanEval 95.1% - 2
Claude Mythos Preview
Score 94.2 · — input · 128K context
SWE 93.9%GPQA 94.6% - 3
Claude 3.5 Sonnet
Score 93.7 · — input · — context
HumanEval 93.7% - 4
GPT-5.5
Score 93.6 · — input · 128K context
GPQA 93.6% - 5
GPT-5.2 Pro
Score 93.2 · $1.75/1M input · — context
GPQA 93.2% - 6
GPT-5.4
Score 92.8 · — input · 1M context
GPQA 92.8% - 7
Qwen2.5-Coder 32B Instruct
Score 92.7 · — input · 128K context
HumanEval 92.7% - 8
o1-mini
Score 92.4 · — input · — context
HumanEval 92.4% - 9
Claude 3.5 Sonnet
Score 92.0 · — input · — context
HumanEval 92.0% - 10
Mistral Large 2
Score 92.0 · — input · 128K context
HumanEval 92.0% - 11
Qwen2.5 VL 32B Instruct
Score 91.5 · — input · — context
HumanEval 91.5% - 12
Claude Opus 4.7
Score 90.9 · — input · — context
SWE 87.6%GPQA 94.2% - 13
GPT-4o
Score 90.2 · — input · — context
HumanEval 90.2% - 14
Granite 3.3 8B Base
Score 89.7 · — input · 128K context
HumanEval 89.7% - 15
Granite 3.3 8B Instruct
Score 89.7 · — input · 128K context
HumanEval 89.7% - 16
Gemini Diffusion
Score 89.6 · — input · — context
HumanEval 89.6% - 17
DeepSeek-V2.5
Score 89.0 · — input · — context
HumanEval 89.0% - 18
Llama 3.1 405B Instruct
Score 89.0 · — input · 128K context
HumanEval 89.0% - 19
Nova Pro
Score 89.0 · — input · — context
HumanEval 89.0% - 20
Kimi K2 0905
Score 88.5 · — input · 256K context
MMLU-Pro 82.5%HumanEval 94.5% - 21
Mistral Small 3.1 24B Instruct
Score 88.4 · — input · 128K context
HumanEval 88.4% - 22
Grok-2
Score 88.4 · — input · — context
HumanEval 88.4% - 23
Llama 3.3 70B Instruct
Score 88.4 · — input · 128K context
HumanEval 88.4% - 24
Qwen2.5 32B Instruct
Score 88.4 · — input · 8K context
HumanEval 88.4% - 25
Qwen2.5-Coder 7B Instruct
Score 88.4 · — input · 128K context
HumanEval 88.4% - 26
Claude 3.5 Haiku
Score 88.1 · — input · — context
HumanEval 88.1% - 27
GPT-5 Medium
Score 88.1 · — input · — context
GPQA 88.1% - 28
GPT-5.1 High
Score 88.1 · — input · — context
GPQA 88.1% - 29
o1
Score 88.1 · — input · — context
HumanEval 88.1% - 30
GPT-4.5
Score 88.0 · — input · 128K context
HumanEval 88.0% - 31
GPT-5.4 mini
Score 88.0 · — input · 128K context
GPQA 88.0% - 32
DeepSeek-V4-Pro-Max
Score 87.9 · — input · — context
SWE 80.6%GPQA 90.1%MMLU-Pro 87.5%LCB 93.5% - 33
Gemma 3 27B
Score 87.8 · — input · 128K context
HumanEval 87.8% - 34
Gemini 3.1 Pro
Score 87.5 · — input · — context
SWE 80.6%GPQA 94.3% - 35
GPT-5 High
Score 87.3 · — input · — context
GPQA 87.3% - 36
Kimi K2 Instruct
Score 87.2 · — input · — context
MMLU-Pro 81.1%HumanEval 93.3% - 37
GPT-4o mini
Score 87.2 · — input · 128K context
HumanEval 87.2% - 38
GPT-4 Turbo
Score 87.1 · — input · — context
HumanEval 87.1% - 39
Gemini 3.1 Flash-Lite
Score 86.9 · — input · 32K context
GPQA 86.9% - 40
DeepSeek-V4-Flash-Max
Score 86.2 · — input · — context
SWE 79.0%GPQA 88.1%MMLU-Pro 86.2%LCB 91.6% - 41
GPT-5.2
Score 86.2 · $1.75/1M input · 256K context
SWE 80.0%GPQA 92.4% - 42
GLM-5.1
Score 86.2 · — input · 200K context
GPQA 86.2% - 43
Claude Opus 4.6
Score 86.1 · — input · 1M context
SWE 80.8%GPQA 91.3% - 44
Sarvam-30B
Score 86.1 · — input · — context
MMLU-Pro 80.0%HumanEval 92.1% - 45
ERNIE 5.0
Score 86.0 · — input · — context
GPQA 85.0%MMLU-Pro 87.0% - 46
Qwen2 72B Instruct
Score 86.0 · — input · 131K context
HumanEval 86.0% - 47
Qwen3.6 Plus
Score 85.9 · — input · — context
SWE 78.8%GPQA 90.4%MMLU-Pro 88.5% - 48
Grok-2 mini
Score 85.7 · — input · — context
HumanEval 85.7% - 49
GPT-5.5 Instant
Score 85.6 · — input · — context
GPQA 85.6% - 50
Gemma 3 12B
Score 85.4 · — input · 128K context
HumanEval 85.4%
How this table works
Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.
Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.
Looking for SaaS tools? Browse categories or compare tools.