MATH Benchmark Leaderboard
MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
Leaderboard
Top 50 models on MATH Benchmark Leaderboard (scores from public evaluations).
- 1o3-mini97.9% on MATH Benchmark Leaderboard
- 2o196.4% on MATH Benchmark Leaderboard
- 3MiniStral 3 (14B Instruct 2512)90.4% on MATH Benchmark Leaderboard
- 3Mistral Large 390.4% on MATH Benchmark Leaderboard
- 5Gemini 2.0 Flash89.7% on MATH Benchmark Leaderboard
- 6Kimi K2 090589.1% on MATH Benchmark Leaderboard
- 7Gemma 3 27B89.0% on MATH Benchmark Leaderboard
- 8Ministral 3 (8B Instruct 2512)87.6% on MATH Benchmark Leaderboard
- 9Gemini 2.0 Flash-Lite86.8% on MATH Benchmark Leaderboard
- 10Gemini 1.5 Pro86.5% on MATH Benchmark Leaderboard
- 11o1-preview85.5% on MATH Benchmark Leaderboard
- 12GPT-584.7% on MATH Benchmark Leaderboard
- 13Gemma 3 12B83.8% on MATH Benchmark Leaderboard
- 14Qwen2.5 32B Instruct83.1% on MATH Benchmark Leaderboard
- 14Qwen2.5 72B Instruct83.1% on MATH Benchmark Leaderboard
- 16Ministral 3 (3B Instruct 2512)83.0% on MATH Benchmark Leaderboard
- 17Qwen2.5 VL 32B Instruct82.2% on MATH Benchmark Leaderboard
- 18Phi 480.4% on MATH Benchmark Leaderboard
- 19Qwen2.5 14B Instruct80.0% on MATH Benchmark Leaderboard
- 20Claude 3.5 Sonnet78.3% on MATH Benchmark Leaderboard
- 21Gemini 1.5 Flash77.9% on MATH Benchmark Leaderboard
- 22Llama 3.3 70B Instruct77.0% on MATH Benchmark Leaderboard
- 23Nova Pro76.6% on MATH Benchmark Leaderboard
- 23GPT-4o76.6% on MATH Benchmark Leaderboard
- 25Grok-276.1% on MATH Benchmark Leaderboard
- 26Gemma 3 4B75.6% on MATH Benchmark Leaderboard
- 27Qwen2.5 7B Instruct75.5% on MATH Benchmark Leaderboard
- 28DeepSeek-V2.574.7% on MATH Benchmark Leaderboard
- 29Llama 3.1 405B Instruct73.8% on MATH Benchmark Leaderboard
- 30Nova Lite73.3% on MATH Benchmark Leaderboard
- 31Grok-2 mini73.0% on MATH Benchmark Leaderboard
- 32GPT-4 Turbo72.6% on MATH Benchmark Leaderboard
- 33Qwen3 235B A22B71.8% on MATH Benchmark Leaderboard
- 34Qwen2.5-Omni-7B71.5% on MATH Benchmark Leaderboard
- 35Claude 3.5 Sonnet71.1% on MATH Benchmark Leaderboard
- 36Mistral Small 3 24B Instruct70.6% on MATH Benchmark Leaderboard
- 37Kimi K2 Base70.2% on MATH Benchmark Leaderboard
- 37GPT-4o mini70.2% on MATH Benchmark Leaderboard
- 39Mistral Small 3.2 24B Instruct69.4% on MATH Benchmark Leaderboard
- 40Claude 3.5 Haiku69.4% on MATH Benchmark Leaderboard
- 41Nova Micro69.3% on MATH Benchmark Leaderboard
- 41Mistral Small 3.1 24B Instruct69.3% on MATH Benchmark Leaderboard
- 43Llama 3.2 90B Instruct68.0% on MATH Benchmark Leaderboard
- 44Phi 4 Mini64.0% on MATH Benchmark Leaderboard
- 45Llama 4 Maverick61.2% on MATH Benchmark Leaderboard
- 46Claude 3 Opus60.1% on MATH Benchmark Leaderboard
- 47Qwen2 72B Instruct59.7% on MATH Benchmark Leaderboard
- 48Phi-3.5-MoE-instruct59.5% on MATH Benchmark Leaderboard
- 49Gemini 1.5 Flash 8B58.7% on MATH Benchmark Leaderboard
- 50Qwen2.5-Coder 32B Instruct57.2% on MATH Benchmark Leaderboard
Models tracked
Models with math in their evaluation profile.
- ChatGPT-4o Latest
- Claude 3.5 HaikuAnthropic
- Claude 3.5 Sonnet
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
- Claude 3 Haiku
- Claude 3 OpusAnthropic
- Claude 3 Sonnet
- Claude Haiku 4.5Anthropic
- Claude Mythos PreviewAnthropic
- Claude Opus 4.1
- Claude Opus 4Anthropic
- Claude Opus 4.5
- Claude Opus 4.6Anthropic
- Claude Opus 4.7Anthropic
- Claude Sonnet 4
- Claude Sonnet 4.5
- Claude Sonnet 4.6
- Codestral-22B
- Command R+
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- DeepSeek-V3.2 (Non-thinking)DeepSeek
- DeepSeek-R1-0528DeepSeek
- DeepSeek R1 Distill Llama 70BDeepSeek
- DeepSeek R1 Distill Llama 8BDeepSeek
- DeepSeek R1 Distill Qwen 14BDeepSeek
- DeepSeek R1 Distill Qwen 32BDeepSeek
- DeepSeek R1 Distill Qwen 7BDeepSeek
- DeepSeek R1 ZeroOpenAI
- DeepSeek-V3.2 (Thinking)DeepSeek
- DeepSeek-V2.5DeepSeek
- DeepSeek-V3 0324
- DeepSeek-V3.1DeepSeek
- DeepSeek-V3.2-ExpDeepSeek
- DeepSeek-V3.2-SpecialeDeepSeek
- DeepSeek-V3.2DeepSeek
- DeepSeek-V3
- DeepSeek-V4-Flash-MaxDeepSeek
- DeepSeek-V4-Pro-MaxDeepSeek
- DeepSeek VL2 SmallDeepSeek
- DeepSeek VL2 TinyDeepSeek
- DeepSeek VL2DeepSeek
- ERNIE 4.5
- ERNIE 5.0
- Gemini 1.0 Pro
- Gemini 1.5 Flash 8B