Best AI for Math

Proofs, equations, and STEM homework

50 models · 4 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

  1. 1

    LongCat-Flash-Thinking-2601

    Score 99.6 · input · context

    AIME 99.6%
  2. 2

    Nemotron 3 Nano (30B A3B)

    Score 99.2 · input · context

    AIME 99.2%
  3. 3

    GPT OSS 20B High

    Score 98.7 · input · context

    AIME 98.7%
  4. 4

    GPT-5.1 Medium

    Score 98.4 · input · context

    AIME 98.4%
  5. 5

    Step-3.5-Flash

    Score 97.3 · input · context

    AIME 97.3%
  6. 6

    GPT-5.1 Codex High

    Score 96.7 · input · context

    AIME 96.7%
  7. 7

    Sarvam-105B

    Score 96.7 · input · context

    AIME 96.7%
  8. 8

    Sarvam-30B

    Score 96.7 · input · context

    AIME 96.7%
  9. 9

    GPT-5.2 Pro

    Score 96.6 · $1.75/1M input · context

    AIME 100.0%
    GPQA 93.2%
  10. 10

    DeepSeek-V3.2-Speciale

    Score 96.0 · input · context

    AIME 96.0%
  11. 11

    Gemini 3 Pro

    Score 96.0 · input · 32K context

    AIME 100.0%
    GPQA 91.9%
  12. 12

    Claude Opus 4.6

    Score 95.5 · input · 1M context

    AIME 99.8%
    GPQA 91.3%
  13. 13

    Gemini 3 Flash

    Score 95.0 · input · 32K context

    AIME 99.7%
    GPQA 90.4%
  14. 14

    Claude Mythos Preview

    Score 94.6 · input · 128K context

    GPQA 94.6%
  15. 15

    Gemini 3.1 Pro

    Score 94.3 · input · context

    GPQA 94.3%
  16. 16

    Grok-4 Heavy

    Score 94.2 · input · context

    AIME 100.0%
    GPQA 88.4%
  17. 17

    Claude Opus 4.7

    Score 94.2 · input · context

    GPQA 94.2%
  18. 18

    GLM-4.6

    Score 93.9 · input · 200K context

    AIME 93.9%
  19. 19

    GPT-5.1 High

    Score 93.8 · input · context

    AIME 99.6%
    GPQA 88.1%
  20. 20

    Seed 2.0 Pro

    Score 93.6 · input · context

    AIME 98.3%
    GPQA 88.9%
  21. 21

    DeepSeek-V3.2

    Score 93.1 · input · context

    AIME 93.1%
  22. 22

    DeepSeek-V3.2 (Thinking)

    Score 93.1 · input · context

    AIME 93.1%
  23. 23

    K-EXAONE-236B-A23B

    Score 92.8 · input · context

    AIME 92.8%
  24. 24

    o4-mini

    Score 92.7 · input · context

    AIME 92.7%
  25. 25

    GPT OSS 120B High

    Score 92.5 · input · context

    AIME 92.5%
  26. 26

    Qwen3-235B-A22B-Thinking-2507

    Score 92.3 · input · context

    AIME 92.3%
  27. 27

    Kimi K2-Thinking-0905

    Score 92.3 · input · 256K context

    AIME 100.0%
    GPQA 84.5%
  28. 28

    Kimi K2.5

    Score 91.8 · input · context

    AIME 96.1%
    GPQA 87.6%
  29. 29

    GLM-4.7-Flash

    Score 91.6 · input · context

    AIME 91.6%
  30. 30

    Mercury 2

    Score 91.1 · input · 128K context

    AIME 91.1%
  31. 31

    GPT-5 High

    Score 91.0 · input · context

    AIME 94.6%
    GPQA 87.3%
  32. 32

    GLM-4.7

    Score 90.7 · input · context

    AIME 95.7%
    GPQA 85.7%
  33. 33

    LongCat-Flash-Thinking

    Score 90.6 · input · context

    AIME 90.6%
  34. 34

    Kimi K2.6

    Score 90.5 · input · 256K context

    GPQA 90.5%
  35. 35

    MiniStral 3 (14B Instruct 2512)

    Score 90.4 · input · context

    MATH 90.4%
  36. 36

    Mistral Large 3

    Score 90.4 · input · context

    MATH 90.4%
  37. 37

    Qwen3.6 Plus

    Score 90.4 · input · context

    GPQA 90.4%
  38. 38

    Nemotron 3 Super (120B A12B)

    Score 90.2 · input · context

    AIME 90.2%
  39. 39

    DeepSeek-V4-Pro-Max

    Score 90.1 · input · context

    GPQA 90.1%
  40. 40

    Claude Sonnet 4.6

    Score 89.9 · input · 1M context

    GPQA 89.9%
  41. 41

    Gemini 2.0 Flash

    Score 89.7 · input · 1M context

    MATH 89.7%
  42. 42

    Qwen3 VL 235B A22B Thinking

    Score 89.7 · input · 256K context

    AIME 89.7%
  43. 43

    Grok-4

    Score 89.6 · input · context

    AIME 91.7%
    GPQA 87.5%
  44. 44

    Muse Spark

    Score 89.5 · input · context

    GPQA 89.5%
  45. 45

    DeepSeek-V3.2-Exp

    Score 89.3 · input · context

    AIME 89.3%
  46. 46

    Kimi K2 0905

    Score 89.1 · input · 256K context

    MATH 89.1%
  47. 47

    Seed 2.0 Lite

    Score 89.1 · input · context

    AIME 93.0%
    GPQA 85.1%
  48. 48

    Gemma 3 27B

    Score 89.0 · input · 128K context

    MATH 89.0%
  49. 49

    Grok-3

    Score 88.9 · input · context

    AIME 93.3%
    GPQA 84.6%
  50. 50

    MiMo-V2-Flash

    Score 88.9 · input · 256K context

    AIME 94.1%
    GPQA 83.7%

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.