Best AI for Reasoning

Logic, planning, and hard problems

50 models · 4 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

#	Model	Score	Input / 1M	Context	GPQA	ARC-2	ARC	HLE
1	Seed 2.0 Pro	88.9	—	—	88.9%	—	—	—
2	GPT-5 Medium	88.1	—	—	88.1%	—	—	—
3	GPT-5.1	88.1	—	—	88.1%	—	—	—
4	GPT-5.1 High	88.1	—	—	88.1%	—	—	—
5	GPT-5.1 Instant	88.1	—	—	88.1%	—	—	—
6	GPT-5.1 Thinking	88.1	—	—	88.1%	—	—	—
7	GPT-5 High	87.3	—	—	87.3%	—	—	—
8	Gemini 3.1 Flash-Lite	86.9	—	32K	86.9%	—	—	—
9	GPT-5.5 Instant OpenAI	85.6	—	—	85.6%	—	—	—
10	Seed 2.0 Lite	85.1	—	—	85.1%	—	—	—
11	Claude 3.7 Sonnet	84.8	—	—	84.8%	—	—	—
12	Grok-3 xAI	84.6	—	—	84.6%	—	—	—
13	ChatGPT-4o Latest	84.0	—	—	84.0%	—	—	—
14	Grok-3 Mini xAI	84.0	—	—	84.0%	—	—	—
15	GPT-5.5 OpenAI	81.5	—	128K	93.6%	85.0%	95.0%	52.2%
16	Claude Mythos Preview Anthropic	79.7	—	128K	94.6%	—	—	64.7%
17	GPT-5.4 OpenAI	74.9	—	1M	92.8%	73.3%	93.7%	39.8%
18	Claude Opus 4.7 Anthropic	74.4	—	—	94.2%	—	—	54.7%
19	Gemini 3.1 Pro	74.3	—	—	94.3%	77.1%	—	51.4%
20	Claude Opus 4.6 Anthropic	71.1	—	1M	91.3%	68.8%	—	53.1%
21	Grok-4 Heavy	69.5	—	—	88.4%	—	—	50.7%
22	GLM-5.1	69.3	—	200K	86.2%	—	—	52.3%
23	DeepSeek-V4-Pro-Max DeepSeek	69.2	—	—	90.1%	—	—	48.2%
24	Kimi K2.5 Moonshot	68.9	—	—	87.6%	—	—	50.2%
25	GPT-5.2 Pro	68.6	$1.75/1M	—	93.2%	54.2%	90.5%	36.6%
26	Kimi K2-Thinking-0905 Cohere	67.8	—	256K	84.5%	—	—	51.0%
27	Qwen3.5-122B-A10B Qwen	67.0	—	262K	86.6%	—	—	47.5%
28	Qwen3.5-27B Qwen	67.0	—	262K	85.5%	—	—	48.5%
29	DeepSeek-V4-Flash-Max DeepSeek	66.6	—	—	88.1%	—	—	45.1%
30	GPT-5.2	66.5	$1.75/1M	256K	92.4%	52.9%	86.2%	34.5%
31	Qwen3.5-35B-A3B Qwen	65.8	—	262K	84.2%	—	—	47.4%
32	Claude Sonnet 4.6	65.7	—	1M	89.9%	58.3%	—	49.0%
33	GLM-4.7	64.3	—	—	85.7%	—	—	42.8%
34	Muse Spark Meta	63.5	—	—	89.5%	42.5%	—	58.4%
35	Kimi K2.6 Moonshot	63.5	—	256K	90.5%	—	—	36.4%
36	Claude Opus 4.5	62.3	—	200K	87.0%	37.6%	—	—
37	ERNIE 5.0	62.0	—	—	85.0%	—	—	39.0%
38	Qwen3.6 Plus Alibaba	59.6	—	—	90.4%	—	—	28.8%
39	Qwen3.5-397B-A17B Qwen	58.6	—	—	88.4%	—	—	28.7%
40	GPT-5.4 mini OpenAI	58.1	—	128K	88.0%	—	—	28.2%
41	GPT-5.5 Pro	57.2	$30/1M	1M	—	—	—	57.2%
42	Gemini 3 Pro	56.3	—	32K	91.9%	31.1%	—	45.8%
43	Gemini 3.5 Flash Google	56.1	—	—	—	72.1%	—	40.2%
44	Qwen3.6-27B Alibaba	55.9	—	—	87.8%	—	—	24.0%
45	Gemini 3 Flash	55.8	—	32K	90.4%	33.6%	—	43.5%
46	Gemma 4 31B Google	55.4	—	256K	84.3%	—	—	26.5%
47	GPT-5 OpenAI	55.3	—	128K	85.7%	—	—	24.8%
48	Gemini 2.5 Pro Preview 06-05 Google	54.0	—	1M	86.4%	—	—	21.6%
49	Qwen3.6-35B-A3B Qwen	53.7	—	262K	86.0%	—	—	21.4%
50	MiMo-V2-Flash	52.9	—	256K	83.7%	—	—	22.1%

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.