Best AI for Math
Proofs, equations, and STEM homework
50 models · 4 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.
- 1
LongCat-Flash-Thinking-2601
Score 99.6 · — input · — context
AIME 99.6% - 2
Nemotron 3 Nano (30B A3B)
Score 99.2 · — input · — context
AIME 99.2% - 3
GPT OSS 20B High
Score 98.7 · — input · — context
AIME 98.7% - 4
GPT-5.1 Medium
Score 98.4 · — input · — context
AIME 98.4% - 5
Step-3.5-Flash
Score 97.3 · — input · — context
AIME 97.3% - 6
GPT-5.1 Codex High
Score 96.7 · — input · — context
AIME 96.7% - 7
Sarvam-105B
Score 96.7 · — input · — context
AIME 96.7% - 8
Sarvam-30B
Score 96.7 · — input · — context
AIME 96.7% - 9
GPT-5.2 Pro
Score 96.6 · $1.75/1M input · — context
AIME 100.0%GPQA 93.2% - 10
DeepSeek-V3.2-Speciale
Score 96.0 · — input · — context
AIME 96.0% - 11
Gemini 3 Pro
Score 96.0 · — input · 32K context
AIME 100.0%GPQA 91.9% - 12
Claude Opus 4.6
Score 95.5 · — input · 1M context
AIME 99.8%GPQA 91.3% - 13
Gemini 3 Flash
Score 95.0 · — input · 32K context
AIME 99.7%GPQA 90.4% - 14
Claude Mythos Preview
Score 94.6 · — input · 128K context
GPQA 94.6% - 15
Gemini 3.1 Pro
Score 94.3 · — input · — context
GPQA 94.3% - 16
Grok-4 Heavy
Score 94.2 · — input · — context
AIME 100.0%GPQA 88.4% - 17
Claude Opus 4.7
Score 94.2 · — input · — context
GPQA 94.2% - 18
GLM-4.6
Score 93.9 · — input · 200K context
AIME 93.9% - 19
GPT-5.1 High
Score 93.8 · — input · — context
AIME 99.6%GPQA 88.1% - 20
Seed 2.0 Pro
Score 93.6 · — input · — context
AIME 98.3%GPQA 88.9% - 21
DeepSeek-V3.2
Score 93.1 · — input · — context
AIME 93.1% - 22
DeepSeek-V3.2 (Thinking)
Score 93.1 · — input · — context
AIME 93.1% - 23
K-EXAONE-236B-A23B
Score 92.8 · — input · — context
AIME 92.8% - 24
o4-mini
Score 92.7 · — input · — context
AIME 92.7% - 25
GPT OSS 120B High
Score 92.5 · — input · — context
AIME 92.5% - 26
Qwen3-235B-A22B-Thinking-2507
Score 92.3 · — input · — context
AIME 92.3% - 27
Kimi K2-Thinking-0905
Score 92.3 · — input · 256K context
AIME 100.0%GPQA 84.5% - 28
Kimi K2.5
Score 91.8 · — input · — context
AIME 96.1%GPQA 87.6% - 29
GLM-4.7-Flash
Score 91.6 · — input · — context
AIME 91.6% - 30
Mercury 2
Score 91.1 · — input · 128K context
AIME 91.1% - 31
GPT-5 High
Score 91.0 · — input · — context
AIME 94.6%GPQA 87.3% - 32
GLM-4.7
Score 90.7 · — input · — context
AIME 95.7%GPQA 85.7% - 33
LongCat-Flash-Thinking
Score 90.6 · — input · — context
AIME 90.6% - 34
Kimi K2.6
Score 90.5 · — input · 256K context
GPQA 90.5% - 35
MiniStral 3 (14B Instruct 2512)
Score 90.4 · — input · — context
MATH 90.4% - 36
Mistral Large 3
Score 90.4 · — input · — context
MATH 90.4% - 37
Qwen3.6 Plus
Score 90.4 · — input · — context
GPQA 90.4% - 38
Nemotron 3 Super (120B A12B)
Score 90.2 · — input · — context
AIME 90.2% - 39
DeepSeek-V4-Pro-Max
Score 90.1 · — input · — context
GPQA 90.1% - 40
Claude Sonnet 4.6
Score 89.9 · — input · 1M context
GPQA 89.9% - 41
Gemini 2.0 Flash
Score 89.7 · — input · 1M context
MATH 89.7% - 42
Qwen3 VL 235B A22B Thinking
Score 89.7 · — input · 256K context
AIME 89.7% - 43
Grok-4
Score 89.6 · — input · — context
AIME 91.7%GPQA 87.5% - 44
Muse Spark
Score 89.5 · — input · — context
GPQA 89.5% - 45
DeepSeek-V3.2-Exp
Score 89.3 · — input · — context
AIME 89.3% - 46
Kimi K2 0905
Score 89.1 · — input · 256K context
MATH 89.1% - 47
Seed 2.0 Lite
Score 89.1 · — input · — context
AIME 93.0%GPQA 85.1% - 48
Gemma 3 27B
Score 89.0 · — input · 128K context
MATH 89.0% - 49
Grok-3
Score 88.9 · — input · — context
AIME 93.3%GPQA 84.6% - 50
MiMo-V2-Flash
Score 88.9 · — input · 256K context
AIME 94.1%GPQA 83.7%
How this table works
Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.
Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.
Looking for SaaS tools? Browse categories or compare tools.