Best AI for Reasoning

Logic, planning, and hard problems

50 models · 4 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

  1. 1

    Seed 2.0 Pro

    Score 88.9 · input · context

    GPQA 88.9%
  2. 2

    GPT-5 Medium

    Score 88.1 · input · context

    GPQA 88.1%
  3. 3

    GPT-5.1

    Score 88.1 · input · context

    GPQA 88.1%
  4. 4

    GPT-5.1 High

    Score 88.1 · input · context

    GPQA 88.1%
  5. 5

    GPT-5.1 Instant

    Score 88.1 · input · context

    GPQA 88.1%
  6. 6

    GPT-5.1 Thinking

    Score 88.1 · input · context

    GPQA 88.1%
  7. 7

    GPT-5 High

    Score 87.3 · input · context

    GPQA 87.3%
  8. 8

    Gemini 3.1 Flash-Lite

    Score 86.9 · input · 32K context

    GPQA 86.9%
  9. 9

    GPT-5.5 Instant

    Score 85.6 · input · context

    GPQA 85.6%
  10. 10

    Seed 2.0 Lite

    Score 85.1 · input · context

    GPQA 85.1%
  11. 11

    Claude 3.7 Sonnet

    Score 84.8 · input · context

    GPQA 84.8%
  12. 12

    Grok-3

    Score 84.6 · input · context

    GPQA 84.6%
  13. 13

    ChatGPT-4o Latest

    Score 84.0 · input · context

    GPQA 84.0%
  14. 14

    Grok-3 Mini

    Score 84.0 · input · context

    GPQA 84.0%
  15. 15

    GPT-5.5

    Score 81.5 · input · 128K context

    GPQA 93.6%
    ARC-2 85.0%
    ARC 95.0%
    HLE 52.2%
  16. 16

    Claude Mythos Preview

    Score 79.7 · input · 128K context

    GPQA 94.6%
    HLE 64.7%
  17. 17

    GPT-5.4

    Score 74.9 · input · 1M context

    GPQA 92.8%
    ARC-2 73.3%
    ARC 93.7%
    HLE 39.8%
  18. 18

    Claude Opus 4.7

    Score 74.4 · input · context

    GPQA 94.2%
    HLE 54.7%
  19. 19

    Gemini 3.1 Pro

    Score 74.3 · input · context

    GPQA 94.3%
    ARC-2 77.1%
    HLE 51.4%
  20. 20

    Claude Opus 4.6

    Score 71.1 · input · 1M context

    GPQA 91.3%
    ARC-2 68.8%
    HLE 53.1%
  21. 21

    Grok-4 Heavy

    Score 69.5 · input · context

    GPQA 88.4%
    HLE 50.7%
  22. 22

    GLM-5.1

    Score 69.3 · input · 200K context

    GPQA 86.2%
    HLE 52.3%
  23. 23

    DeepSeek-V4-Pro-Max

    Score 69.2 · input · context

    GPQA 90.1%
    HLE 48.2%
  24. 24

    Kimi K2.5

    Score 68.9 · input · context

    GPQA 87.6%
    HLE 50.2%
  25. 25

    GPT-5.2 Pro

    Score 68.6 · $1.75/1M input · context

    GPQA 93.2%
    ARC-2 54.2%
    ARC 90.5%
    HLE 36.6%
  26. 26

    Kimi K2-Thinking-0905

    Score 67.8 · input · 256K context

    GPQA 84.5%
    HLE 51.0%
  27. 27

    Qwen3.5-122B-A10B

    Score 67.0 · input · 262K context

    GPQA 86.6%
    HLE 47.5%
  28. 28

    Qwen3.5-27B

    Score 67.0 · input · 262K context

    GPQA 85.5%
    HLE 48.5%
  29. 29

    DeepSeek-V4-Flash-Max

    Score 66.6 · input · context

    GPQA 88.1%
    HLE 45.1%
  30. 30

    GPT-5.2

    Score 66.5 · $1.75/1M input · 256K context

    GPQA 92.4%
    ARC-2 52.9%
    ARC 86.2%
    HLE 34.5%
  31. 31

    Qwen3.5-35B-A3B

    Score 65.8 · input · 262K context

    GPQA 84.2%
    HLE 47.4%
  32. 32

    Claude Sonnet 4.6

    Score 65.7 · input · 1M context

    GPQA 89.9%
    ARC-2 58.3%
    HLE 49.0%
  33. 33

    GLM-4.7

    Score 64.3 · input · context

    GPQA 85.7%
    HLE 42.8%
  34. 34

    Muse Spark

    Score 63.5 · input · context

    GPQA 89.5%
    ARC-2 42.5%
    HLE 58.4%
  35. 35

    Kimi K2.6

    Score 63.5 · input · 256K context

    GPQA 90.5%
    HLE 36.4%
  36. 36

    Claude Opus 4.5

    Score 62.3 · input · 200K context

    GPQA 87.0%
    ARC-2 37.6%
  37. 37

    ERNIE 5.0

    Score 62.0 · input · context

    GPQA 85.0%
    HLE 39.0%
  38. 38

    Qwen3.6 Plus

    Score 59.6 · input · context

    GPQA 90.4%
    HLE 28.8%
  39. 39

    Qwen3.5-397B-A17B

    Score 58.6 · input · context

    GPQA 88.4%
    HLE 28.7%
  40. 40

    GPT-5.4 mini

    Score 58.1 · input · 128K context

    GPQA 88.0%
    HLE 28.2%
  41. 41

    GPT-5.5 Pro

    Score 57.2 · $30/1M input · 1M context

    HLE 57.2%
  42. 42

    Gemini 3 Pro

    Score 56.3 · input · 32K context

    GPQA 91.9%
    ARC-2 31.1%
    HLE 45.8%
  43. 43

    Gemini 3.5 Flash

    Score 56.1 · input · context

    ARC-2 72.1%
    HLE 40.2%
  44. 44

    Qwen3.6-27B

    Score 55.9 · input · context

    GPQA 87.8%
    HLE 24.0%
  45. 45

    Gemini 3 Flash

    Score 55.8 · input · 32K context

    GPQA 90.4%
    ARC-2 33.6%
    HLE 43.5%
  46. 46

    Gemma 4 31B

    Score 55.4 · input · 256K context

    GPQA 84.3%
    HLE 26.5%
  47. 47

    GPT-5

    Score 55.3 · input · 128K context

    GPQA 85.7%
    HLE 24.8%
  48. 48

    Gemini 2.5 Pro Preview 06-05

    Score 54.0 · input · 1M context

    GPQA 86.4%
    HLE 21.6%
  49. 49

    Qwen3.6-35B-A3B

    Score 53.7 · input · 262K context

    GPQA 86.0%
    HLE 21.4%
  50. 50

    MiMo-V2-Flash

    Score 52.9 · input · 256K context

    GPQA 83.7%
    HLE 22.1%

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.