GSM8k Benchmark Leaderboard

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

Leaderboard

Top 47 models on GSM8k Benchmark Leaderboard (scores from public evaluations).

  1. 1Kimi K2 Instruct97.3% on GSM8k Benchmark Leaderboard
  2. 2o197.1% on GSM8k Benchmark Leaderboard
  3. 3GPT-4.597.0% on GSM8k Benchmark Leaderboard
  4. 4Llama 3.1 405B Instruct96.8% on GSM8k Benchmark Leaderboard
  5. 5Claude 3.5 Sonnet96.4% on GSM8k Benchmark Leaderboard
  6. 5Claude 3.5 Sonnet96.4% on GSM8k Benchmark Leaderboard
  7. 7Gemma 3 27B95.9% on GSM8k Benchmark Leaderboard
  8. 7Qwen2.5 32B Instruct95.9% on GSM8k Benchmark Leaderboard
  9. 9Qwen2.5 72B Instruct95.8% on GSM8k Benchmark Leaderboard
  10. 10DeepSeek-V2.595.1% on GSM8k Benchmark Leaderboard
  11. 11Claude 3 Opus95.0% on GSM8k Benchmark Leaderboard
  12. 12Nova Pro94.8% on GSM8k Benchmark Leaderboard
  13. 12Qwen2.5 14B Instruct94.8% on GSM8k Benchmark Leaderboard
  14. 14Nova Lite94.5% on GSM8k Benchmark Leaderboard
  15. 15Gemma 3 12B94.4% on GSM8k Benchmark Leaderboard
  16. 16Qwen3 235B A22B94.4% on GSM8k Benchmark Leaderboard
  17. 17Mistral Large 293.0% on GSM8k Benchmark Leaderboard
  18. 18Claude 3 Sonnet92.3% on GSM8k Benchmark Leaderboard
  19. 18Nova Micro92.3% on GSM8k Benchmark Leaderboard
  20. 20Kimi K2 Base92.1% on GSM8k Benchmark Leaderboard
  21. 21Qwen2.5 7B Instruct91.6% on GSM8k Benchmark Leaderboard
  22. 22Llama 3.1 Nemotron 70B Instruct91.4% on GSM8k Benchmark Leaderboard
  23. 23Qwen2.5-Coder 32B Instruct91.1% on GSM8k Benchmark Leaderboard
  24. 23Qwen2 72B Instruct91.1% on GSM8k Benchmark Leaderboard
  25. 25Gemini 1.5 Pro90.8% on GSM8k Benchmark Leaderboard
  26. 26Grok-1.590.0% on GSM8k Benchmark Leaderboard
  27. 27Gemma 3 4B89.2% on GSM8k Benchmark Leaderboard
  28. 28Claude 3 Haiku88.9% on GSM8k Benchmark Leaderboard
  29. 29Phi-3.5-MoE-instruct88.7% on GSM8k Benchmark Leaderboard
  30. 29Qwen2.5-Omni-7B88.7% on GSM8k Benchmark Leaderboard
  31. 31Phi 4 Mini88.6% on GSM8k Benchmark Leaderboard
  32. 32Jamba 1.5 Large87.0% on GSM8k Benchmark Leaderboard
  33. 33Phi-3.5-mini-instruct86.2% on GSM8k Benchmark Leaderboard
  34. 33Gemini 1.5 Flash86.2% on GSM8k Benchmark Leaderboard
  35. 35Qwen2.5-Coder 7B Instruct83.9% on GSM8k Benchmark Leaderboard
  36. 36Qwen2 7B Instruct82.3% on GSM8k Benchmark Leaderboard
  37. 37Granite 3.3 8B Instruct80.9% on GSM8k Benchmark Leaderboard
  38. 38Mistral Small 3 24B Base80.7% on GSM8k Benchmark Leaderboard
  39. 39Llama 3.2 3B Instruct77.7% on GSM8k Benchmark Leaderboard
  40. 40Jamba 1.5 Mini75.8% on GSM8k Benchmark Leaderboard
  41. 41Gemma 2 27B74.0% on GSM8k Benchmark Leaderboard
  42. 42Command R+70.7% on GSM8k Benchmark Leaderboard
  43. 43IBM Granite 4.0 Tiny Preview70.1% on GSM8k Benchmark Leaderboard
  44. 44Gemma 2 9B68.6% on GSM8k Benchmark Leaderboard
  45. 45Gemma 3 1B62.8% on GSM8k Benchmark Leaderboard
  46. 46Granite 3.3 8B Base59.0% on GSM8k Benchmark Leaderboard
  47. 47ERNIE 4.525.2% on GSM8k Benchmark Leaderboard

Models tracked

Models with gsm8k in their evaluation profile.

  • No models linked yet.

View task leaderboards →