MATH Benchmark Leaderboard

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Leaderboard

Top 50 models on MATH Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	o3-mini	97.9%	—
2	o1	96.4%	—
3	MiniStral 3 (14B Instruct 2512)	90.4%	—
3	Mistral Large 3	90.4%	—
5	Gemini 2.0 Flash	89.7%	—
6	Kimi K2 0905	89.1%	—
7	Gemma 3 27B	89.0%	—
8	Ministral 3 (8B Instruct 2512)	87.6%	—
9	Gemini 2.0 Flash-Lite	86.8%	—
10	Gemini 1.5 Pro	86.5%	—
11	o1-preview	85.5%	—
12	GPT-5	84.7%	—
13	Gemma 3 12B	83.8%	—
14	Qwen2.5 32B Instruct	83.1%	—
14	Qwen2.5 72B Instruct	83.1%	—
16	Ministral 3 (3B Instruct 2512)	83.0%	—
17	Qwen2.5 VL 32B Instruct	82.2%	—
18	Phi 4	80.4%	—
19	Qwen2.5 14B Instruct	80.0%	—
20	Claude 3.5 Sonnet	78.3%	—
21	Gemini 1.5 Flash	77.9%	—
22	Llama 3.3 70B Instruct	77.0%	—
23	Nova Pro	76.6%	—
23	GPT-4o	76.6%	—
25	Grok-2	76.1%	—
26	Gemma 3 4B	75.6%	—
27	Qwen2.5 7B Instruct	75.5%	—
28	DeepSeek-V2.5	74.7%	—
29	Llama 3.1 405B Instruct	73.8%	—
30	Nova Lite	73.3%	—
31	Grok-2 mini	73.0%	—
32	GPT-4 Turbo	72.6%	—
33	Qwen3 235B A22B	71.8%	—
34	Qwen2.5-Omni-7B	71.5%	—
35	Claude 3.5 Sonnet	71.1%	—
36	Mistral Small 3 24B Instruct	70.6%	—
37	Kimi K2 Base	70.2%	—
37	GPT-4o mini	70.2%	—
39	Mistral Small 3.2 24B Instruct	69.4%	—
40	Claude 3.5 Haiku	69.4%	—
41	Nova Micro	69.3%	—
41	Mistral Small 3.1 24B Instruct	69.3%	—
43	Llama 3.2 90B Instruct	68.0%	—
44	Phi 4 Mini	64.0%	—
45	Llama 4 Maverick	61.2%	—
46	Claude 3 Opus	60.1%	—
47	Qwen2 72B Instruct	59.7%	—
48	Phi-3.5-MoE-instruct	59.5%	—
49	Gemini 1.5 Flash 8B	58.7%	—
50	Qwen2.5-Coder 32B Instruct	57.2%	—

Models tracked

Models with math in their evaluation profile.

View task leaderboards →