MATH Benchmark Leaderboard

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Leaderboard

Top 50 models on MATH Benchmark Leaderboard (scores from public evaluations).

  1. 1o3-mini97.9% on MATH Benchmark Leaderboard
  2. 2o196.4% on MATH Benchmark Leaderboard
  3. 3MiniStral 3 (14B Instruct 2512)90.4% on MATH Benchmark Leaderboard
  4. 3Mistral Large 390.4% on MATH Benchmark Leaderboard
  5. 5Gemini 2.0 Flash89.7% on MATH Benchmark Leaderboard
  6. 6Kimi K2 090589.1% on MATH Benchmark Leaderboard
  7. 7Gemma 3 27B89.0% on MATH Benchmark Leaderboard
  8. 8Ministral 3 (8B Instruct 2512)87.6% on MATH Benchmark Leaderboard
  9. 9Gemini 2.0 Flash-Lite86.8% on MATH Benchmark Leaderboard
  10. 10Gemini 1.5 Pro86.5% on MATH Benchmark Leaderboard
  11. 11o1-preview85.5% on MATH Benchmark Leaderboard
  12. 12GPT-584.7% on MATH Benchmark Leaderboard
  13. 13Gemma 3 12B83.8% on MATH Benchmark Leaderboard
  14. 14Qwen2.5 32B Instruct83.1% on MATH Benchmark Leaderboard
  15. 14Qwen2.5 72B Instruct83.1% on MATH Benchmark Leaderboard
  16. 16Ministral 3 (3B Instruct 2512)83.0% on MATH Benchmark Leaderboard
  17. 17Qwen2.5 VL 32B Instruct82.2% on MATH Benchmark Leaderboard
  18. 18Phi 480.4% on MATH Benchmark Leaderboard
  19. 19Qwen2.5 14B Instruct80.0% on MATH Benchmark Leaderboard
  20. 20Claude 3.5 Sonnet78.3% on MATH Benchmark Leaderboard
  21. 21Gemini 1.5 Flash77.9% on MATH Benchmark Leaderboard
  22. 22Llama 3.3 70B Instruct77.0% on MATH Benchmark Leaderboard
  23. 23Nova Pro76.6% on MATH Benchmark Leaderboard
  24. 23GPT-4o76.6% on MATH Benchmark Leaderboard
  25. 25Grok-276.1% on MATH Benchmark Leaderboard
  26. 26Gemma 3 4B75.6% on MATH Benchmark Leaderboard
  27. 27Qwen2.5 7B Instruct75.5% on MATH Benchmark Leaderboard
  28. 28DeepSeek-V2.574.7% on MATH Benchmark Leaderboard
  29. 29Llama 3.1 405B Instruct73.8% on MATH Benchmark Leaderboard
  30. 30Nova Lite73.3% on MATH Benchmark Leaderboard
  31. 31Grok-2 mini73.0% on MATH Benchmark Leaderboard
  32. 32GPT-4 Turbo72.6% on MATH Benchmark Leaderboard
  33. 33Qwen3 235B A22B71.8% on MATH Benchmark Leaderboard
  34. 34Qwen2.5-Omni-7B71.5% on MATH Benchmark Leaderboard
  35. 35Claude 3.5 Sonnet71.1% on MATH Benchmark Leaderboard
  36. 36Mistral Small 3 24B Instruct70.6% on MATH Benchmark Leaderboard
  37. 37Kimi K2 Base70.2% on MATH Benchmark Leaderboard
  38. 37GPT-4o mini70.2% on MATH Benchmark Leaderboard
  39. 39Mistral Small 3.2 24B Instruct69.4% on MATH Benchmark Leaderboard
  40. 40Claude 3.5 Haiku69.4% on MATH Benchmark Leaderboard
  41. 41Nova Micro69.3% on MATH Benchmark Leaderboard
  42. 41Mistral Small 3.1 24B Instruct69.3% on MATH Benchmark Leaderboard
  43. 43Llama 3.2 90B Instruct68.0% on MATH Benchmark Leaderboard
  44. 44Phi 4 Mini64.0% on MATH Benchmark Leaderboard
  45. 45Llama 4 Maverick61.2% on MATH Benchmark Leaderboard
  46. 46Claude 3 Opus60.1% on MATH Benchmark Leaderboard
  47. 47Qwen2 72B Instruct59.7% on MATH Benchmark Leaderboard
  48. 48Phi-3.5-MoE-instruct59.5% on MATH Benchmark Leaderboard
  49. 49Gemini 1.5 Flash 8B58.7% on MATH Benchmark Leaderboard
  50. 50Qwen2.5-Coder 32B Instruct57.2% on MATH Benchmark Leaderboard

Models tracked

Models with math in their evaluation profile.

View task leaderboards →