Best AI for Math

Proofs, equations, and STEM homework

50 models · 4 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

#	Model	Score	Input / 1M	Context	AIME	GPQA	MATH	Frontier
1	LongCat-Flash-Thinking-2601	99.6	—	—	99.6%	—	—	—
2	Nemotron 3 Nano (30B A3B)	99.2	—	—	99.2%	—	—	—
3	GPT OSS 20B High OpenAI	98.7	—	—	98.7%	—	—	—
4	GPT-5.1 Medium	98.4	—	—	98.4%	—	—	—
5	Step-3.5-Flash	97.3	—	—	97.3%	—	—	—
6	GPT-5.1 Codex High	96.7	—	—	96.7%	—	—	—
7	Sarvam-105B	96.7	—	—	96.7%	—	—	—
8	Sarvam-30B	96.7	—	—	96.7%	—	—	—
9	GPT-5.2 Pro	96.6	$1.75/1M	—	100.0%	93.2%	—	—
10	DeepSeek-V3.2-Speciale DeepSeek	96.0	—	—	96.0%	—	—	—
11	Gemini 3 Pro	96.0	—	32K	100.0%	91.9%	—	—
12	Claude Opus 4.6 Anthropic	95.5	—	1M	99.8%	91.3%	—	—
13	Gemini 3 Flash	95.0	—	32K	99.7%	90.4%	—	—
14	Claude Mythos Preview Anthropic	94.6	—	128K	—	94.6%	—	—
15	Gemini 3.1 Pro	94.3	—	—	—	94.3%	—	—
16	Grok-4 Heavy	94.2	—	—	100.0%	88.4%	—	—
17	Claude Opus 4.7 Anthropic	94.2	—	—	—	94.2%	—	—
18	GLM-4.6 DeepSeek	93.9	—	200K	93.9%	—	—	—
19	GPT-5.1 High	93.8	—	—	99.6%	88.1%	—	—
20	Seed 2.0 Pro	93.6	—	—	98.3%	88.9%	—	—
21	DeepSeek-V3.2 DeepSeek	93.1	—	—	93.1%	—	—	—
22	DeepSeek-V3.2 (Thinking) DeepSeek	93.1	—	—	93.1%	—	—	—
23	K-EXAONE-236B-A23B	92.8	—	—	92.8%	—	—	—
24	o4-mini OpenAI	92.7	—	—	92.7%	—	—	—
25	GPT OSS 120B High	92.5	—	—	92.5%	—	—	—
26	Qwen3-235B-A22B-Thinking-2507 Qwen	92.3	—	—	92.3%	—	—	—
27	Kimi K2-Thinking-0905 Cohere	92.3	—	256K	100.0%	84.5%	—	—
28	Kimi K2.5 Moonshot	91.8	—	—	96.1%	87.6%	—	—
29	GLM-4.7-Flash	91.6	—	—	91.6%	—	—	—
30	Mercury 2	91.1	—	128K	91.1%	—	—	—
31	GPT-5 High	91.0	—	—	94.6%	87.3%	—	—
32	GLM-4.7	90.7	—	—	95.7%	85.7%	—	—
33	LongCat-Flash-Thinking	90.6	—	—	90.6%	—	—	—
34	Kimi K2.6 Moonshot	90.5	—	256K	—	90.5%	—	—
35	MiniStral 3 (14B Instruct 2512)	90.4	—	—	—	—	90.4%	—
36	Mistral Large 3 Mistral	90.4	—	—	—	—	90.4%	—
37	Qwen3.6 Plus Alibaba	90.4	—	—	—	90.4%	—	—
38	Nemotron 3 Super (120B A12B)	90.2	—	—	90.2%	—	—	—
39	DeepSeek-V4-Pro-Max DeepSeek	90.1	—	—	—	90.1%	—	—
40	Claude Sonnet 4.6	89.9	—	1M	—	89.9%	—	—
41	Gemini 2.0 Flash	89.7	—	1M	—	—	89.7%	—
42	Qwen3 VL 235B A22B Thinking Qwen	89.7	—	256K	89.7%	—	—	—
43	Grok-4 xAI	89.6	—	—	91.7%	87.5%	—	—
44	Muse Spark Meta	89.5	—	—	—	89.5%	—	—
45	DeepSeek-V3.2-Exp DeepSeek	89.3	—	—	89.3%	—	—	—
46	Kimi K2 0905 Moonshot	89.1	—	256K	—	—	89.1%	—
47	Seed 2.0 Lite	89.1	—	—	93.0%	85.1%	—	—
48	Gemma 3 27B Google	89.0	—	128K	—	—	89.0%	—
49	Grok-3 xAI	88.9	—	—	93.3%	84.6%	—	—
50	MiMo-V2-Flash	88.9	—	256K	94.1%	83.7%	—	—

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.