Best AI for Reasoning
Logic, planning, and hard problems
50 models · 4 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.
- 1
Seed 2.0 Pro
Score 88.9 · — input · — context
GPQA 88.9% - 2
GPT-5 Medium
Score 88.1 · — input · — context
GPQA 88.1% - 3
GPT-5.1
Score 88.1 · — input · — context
GPQA 88.1% - 4
GPT-5.1 High
Score 88.1 · — input · — context
GPQA 88.1% - 5
GPT-5.1 Instant
Score 88.1 · — input · — context
GPQA 88.1% - 6
GPT-5.1 Thinking
Score 88.1 · — input · — context
GPQA 88.1% - 7
GPT-5 High
Score 87.3 · — input · — context
GPQA 87.3% - 8
Gemini 3.1 Flash-Lite
Score 86.9 · — input · 32K context
GPQA 86.9% - 9
GPT-5.5 Instant
Score 85.6 · — input · — context
GPQA 85.6% - 10
Seed 2.0 Lite
Score 85.1 · — input · — context
GPQA 85.1% - 11
Claude 3.7 Sonnet
Score 84.8 · — input · — context
GPQA 84.8% - 12
Grok-3
Score 84.6 · — input · — context
GPQA 84.6% - 13
ChatGPT-4o Latest
Score 84.0 · — input · — context
GPQA 84.0% - 14
Grok-3 Mini
Score 84.0 · — input · — context
GPQA 84.0% - 15
GPT-5.5
Score 81.5 · — input · 128K context
GPQA 93.6%ARC-2 85.0%ARC 95.0%HLE 52.2% - 16
Claude Mythos Preview
Score 79.7 · — input · 128K context
GPQA 94.6%HLE 64.7% - 17
GPT-5.4
Score 74.9 · — input · 1M context
GPQA 92.8%ARC-2 73.3%ARC 93.7%HLE 39.8% - 18
Claude Opus 4.7
Score 74.4 · — input · — context
GPQA 94.2%HLE 54.7% - 19
Gemini 3.1 Pro
Score 74.3 · — input · — context
GPQA 94.3%ARC-2 77.1%HLE 51.4% - 20
Claude Opus 4.6
Score 71.1 · — input · 1M context
GPQA 91.3%ARC-2 68.8%HLE 53.1% - 21
Grok-4 Heavy
Score 69.5 · — input · — context
GPQA 88.4%HLE 50.7% - 22
GLM-5.1
Score 69.3 · — input · 200K context
GPQA 86.2%HLE 52.3% - 23
DeepSeek-V4-Pro-Max
Score 69.2 · — input · — context
GPQA 90.1%HLE 48.2% - 24
Kimi K2.5
Score 68.9 · — input · — context
GPQA 87.6%HLE 50.2% - 25
GPT-5.2 Pro
Score 68.6 · $1.75/1M input · — context
GPQA 93.2%ARC-2 54.2%ARC 90.5%HLE 36.6% - 26
Kimi K2-Thinking-0905
Score 67.8 · — input · 256K context
GPQA 84.5%HLE 51.0% - 27
Qwen3.5-122B-A10B
Score 67.0 · — input · 262K context
GPQA 86.6%HLE 47.5% - 28
Qwen3.5-27B
Score 67.0 · — input · 262K context
GPQA 85.5%HLE 48.5% - 29
DeepSeek-V4-Flash-Max
Score 66.6 · — input · — context
GPQA 88.1%HLE 45.1% - 30
GPT-5.2
Score 66.5 · $1.75/1M input · 256K context
GPQA 92.4%ARC-2 52.9%ARC 86.2%HLE 34.5% - 31
Qwen3.5-35B-A3B
Score 65.8 · — input · 262K context
GPQA 84.2%HLE 47.4% - 32
Claude Sonnet 4.6
Score 65.7 · — input · 1M context
GPQA 89.9%ARC-2 58.3%HLE 49.0% - 33
GLM-4.7
Score 64.3 · — input · — context
GPQA 85.7%HLE 42.8% - 34
Muse Spark
Score 63.5 · — input · — context
GPQA 89.5%ARC-2 42.5%HLE 58.4% - 35
Kimi K2.6
Score 63.5 · — input · 256K context
GPQA 90.5%HLE 36.4% - 36
Claude Opus 4.5
Score 62.3 · — input · 200K context
GPQA 87.0%ARC-2 37.6% - 37
ERNIE 5.0
Score 62.0 · — input · — context
GPQA 85.0%HLE 39.0% - 38
Qwen3.6 Plus
Score 59.6 · — input · — context
GPQA 90.4%HLE 28.8% - 39
Qwen3.5-397B-A17B
Score 58.6 · — input · — context
GPQA 88.4%HLE 28.7% - 40
GPT-5.4 mini
Score 58.1 · — input · 128K context
GPQA 88.0%HLE 28.2% - 41
GPT-5.5 Pro
Score 57.2 · $30/1M input · 1M context
HLE 57.2% - 42
Gemini 3 Pro
Score 56.3 · — input · 32K context
GPQA 91.9%ARC-2 31.1%HLE 45.8% - 43
Gemini 3.5 Flash
Score 56.1 · — input · — context
ARC-2 72.1%HLE 40.2% - 44
Qwen3.6-27B
Score 55.9 · — input · — context
GPQA 87.8%HLE 24.0% - 45
Gemini 3 Flash
Score 55.8 · — input · 32K context
GPQA 90.4%ARC-2 33.6%HLE 43.5% - 46
Gemma 4 31B
Score 55.4 · — input · 256K context
GPQA 84.3%HLE 26.5% - 47
GPT-5
Score 55.3 · — input · 128K context
GPQA 85.7%HLE 24.8% - 48
Gemini 2.5 Pro Preview 06-05
Score 54.0 · — input · 1M context
GPQA 86.4%HLE 21.6% - 49
Qwen3.6-35B-A3B
Score 53.7 · — input · 262K context
GPQA 86.0%HLE 21.4% - 50
MiMo-V2-Flash
Score 52.9 · — input · 256K context
GPQA 83.7%HLE 22.1%
How this table works
Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.
Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.
Looking for SaaS tools? Browse categories or compare tools.