AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

Leaderboard

Top 50 models on AIME 2025 Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Grok-4 Heavy	100.0%	—
1	GPT-5.2	100.0%	—
1	Kimi K2-Thinking-0905	100.0%	—
1	GPT-5.2 Pro	100.0%	—
1	Gemini 3 Pro	100.0%	—
6	Claude Opus 4.6	99.8%	—
7	Gemini 3 Flash	99.7%	—
8	LongCat-Flash-Thinking-2601	99.6%	—
8	GPT-5.1 High	99.6%	—
10	Nemotron 3 Nano (30B A3B)	99.2%	—
11	GPT OSS 20B High	98.7%	—
12	GPT-5.1 Medium	98.4%	—
13	Seed 2.0 Pro	98.3%	—
14	Step-3.5-Flash	97.3%	—
15	Sarvam-30B	96.7%	—
15	GPT-5.1 Codex High	96.7%	—
15	Sarvam-105B	96.7%	—
18	Kimi K2.5	96.1%	—
19	DeepSeek-V3.2-Speciale	96.0%	—
20	GLM-4.7	95.7%	—
21	GPT-5	94.6%	—
21	GPT-5 High	94.6%	—
23	MiMo-V2-Flash	94.1%	—
24	GPT-5.1 Thinking	94.0%	—
24	GPT-5.1	94.0%	—
24	GPT-5.1 Instant	94.0%	—
27	GLM-4.6	93.9%	—
28	Grok-3	93.3%	—
29	DeepSeek-V3.2 (Thinking)	93.1%	—
29	DeepSeek-V3.2	93.1%	—
31	Seed 2.0 Lite	93.0%	—
32	K-EXAONE-236B-A23B	92.8%	—
33	o4-mini	92.7%	—
34	GPT OSS 120B High	92.5%	—
35	Qwen3-235B-A22B-Thinking-2507	92.3%	—
36	Grok 4 Fast	92.0%	—
37	Grok-4	91.7%	—
38	GLM-4.7-Flash	91.6%	—
39	Mercury 2	91.1%	—
39	GPT-5 mini	91.1%	—
41	Grok-3 Mini	90.8%	—
42	LongCat-Flash-Thinking	90.6%	—
43	Nemotron 3 Super (120B A12B)	90.2%	—
44	Qwen3 VL 235B A22B Thinking	89.7%	—
45	DeepSeek-V3.2-Exp	89.3%	—
46	GPT-5 Medium	88.9%	—
47	Gemini 2.5 Pro Preview 06-05	88.0%	—
48	Qwen3-Next-80B-A3B-Thinking	87.8%	—
49	Step3-VL-10B	87.7%	—
50	DeepSeek-R1-0528	87.5%	—

Models tracked

Models with aime-2025 in their evaluation profile.

View task leaderboards →