Best AI for Overall

Top models across every benchmark

50 models · 5 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

  1. 1

    MiniCPM-SALA

    Score 95.1 · input · 256K context

    HumanEval 95.1%
  2. 2

    Claude Mythos Preview

    Score 94.2 · input · 128K context

    SWE 93.9%
    GPQA 94.6%
  3. 3

    Claude 3.5 Sonnet

    Score 93.7 · input · context

    HumanEval 93.7%
  4. 4

    GPT-5.5

    Score 93.6 · input · 128K context

    GPQA 93.6%
  5. 5

    GPT-5.2 Pro

    Score 93.2 · $1.75/1M input · context

    GPQA 93.2%
  6. 6

    GPT-5.4

    Score 92.8 · input · 1M context

    GPQA 92.8%
  7. 7

    Qwen2.5-Coder 32B Instruct

    Score 92.7 · input · 128K context

    HumanEval 92.7%
  8. 8

    o1-mini

    Score 92.4 · input · context

    HumanEval 92.4%
  9. 9

    Claude 3.5 Sonnet

    Score 92.0 · input · context

    HumanEval 92.0%
  10. 10

    Mistral Large 2

    Score 92.0 · input · 128K context

    HumanEval 92.0%
  11. 11

    Qwen2.5 VL 32B Instruct

    Score 91.5 · input · context

    HumanEval 91.5%
  12. 12

    Claude Opus 4.7

    Score 90.9 · input · context

    SWE 87.6%
    GPQA 94.2%
  13. 13

    GPT-4o

    Score 90.2 · input · context

    HumanEval 90.2%
  14. 14

    Granite 3.3 8B Base

    Score 89.7 · input · 128K context

    HumanEval 89.7%
  15. 15

    Granite 3.3 8B Instruct

    Score 89.7 · input · 128K context

    HumanEval 89.7%
  16. 16

    Gemini Diffusion

    Score 89.6 · input · context

    HumanEval 89.6%
  17. 17

    DeepSeek-V2.5

    Score 89.0 · input · context

    HumanEval 89.0%
  18. 18

    Llama 3.1 405B Instruct

    Score 89.0 · input · 128K context

    HumanEval 89.0%
  19. 19

    Nova Pro

    Score 89.0 · input · context

    HumanEval 89.0%
  20. 20

    Kimi K2 0905

    Score 88.5 · input · 256K context

    MMLU-Pro 82.5%
    HumanEval 94.5%
  21. 21

    Mistral Small 3.1 24B Instruct

    Score 88.4 · input · 128K context

    HumanEval 88.4%
  22. 22

    Grok-2

    Score 88.4 · input · context

    HumanEval 88.4%
  23. 23

    Llama 3.3 70B Instruct

    Score 88.4 · input · 128K context

    HumanEval 88.4%
  24. 24

    Qwen2.5 32B Instruct

    Score 88.4 · input · 8K context

    HumanEval 88.4%
  25. 25

    Qwen2.5-Coder 7B Instruct

    Score 88.4 · input · 128K context

    HumanEval 88.4%
  26. 26

    Claude 3.5 Haiku

    Score 88.1 · input · context

    HumanEval 88.1%
  27. 27

    GPT-5 Medium

    Score 88.1 · input · context

    GPQA 88.1%
  28. 28

    GPT-5.1 High

    Score 88.1 · input · context

    GPQA 88.1%
  29. 29

    o1

    Score 88.1 · input · context

    HumanEval 88.1%
  30. 30

    GPT-4.5

    Score 88.0 · input · 128K context

    HumanEval 88.0%
  31. 31

    GPT-5.4 mini

    Score 88.0 · input · 128K context

    GPQA 88.0%
  32. 32

    DeepSeek-V4-Pro-Max

    Score 87.9 · input · context

    SWE 80.6%
    GPQA 90.1%
    MMLU-Pro 87.5%
    LCB 93.5%
  33. 33

    Gemma 3 27B

    Score 87.8 · input · 128K context

    HumanEval 87.8%
  34. 34

    Gemini 3.1 Pro

    Score 87.5 · input · context

    SWE 80.6%
    GPQA 94.3%
  35. 35

    GPT-5 High

    Score 87.3 · input · context

    GPQA 87.3%
  36. 36

    Kimi K2 Instruct

    Score 87.2 · input · context

    MMLU-Pro 81.1%
    HumanEval 93.3%
  37. 37

    GPT-4o mini

    Score 87.2 · input · 128K context

    HumanEval 87.2%
  38. 38

    GPT-4 Turbo

    Score 87.1 · input · context

    HumanEval 87.1%
  39. 39

    Gemini 3.1 Flash-Lite

    Score 86.9 · input · 32K context

    GPQA 86.9%
  40. 40

    DeepSeek-V4-Flash-Max

    Score 86.2 · input · context

    SWE 79.0%
    GPQA 88.1%
    MMLU-Pro 86.2%
    LCB 91.6%
  41. 41

    GPT-5.2

    Score 86.2 · $1.75/1M input · 256K context

    SWE 80.0%
    GPQA 92.4%
  42. 42

    GLM-5.1

    Score 86.2 · input · 200K context

    GPQA 86.2%
  43. 43

    Claude Opus 4.6

    Score 86.1 · input · 1M context

    SWE 80.8%
    GPQA 91.3%
  44. 44

    Sarvam-30B

    Score 86.1 · input · context

    MMLU-Pro 80.0%
    HumanEval 92.1%
  45. 45

    ERNIE 5.0

    Score 86.0 · input · context

    GPQA 85.0%
    MMLU-Pro 87.0%
  46. 46

    Qwen2 72B Instruct

    Score 86.0 · input · 131K context

    HumanEval 86.0%
  47. 47

    Qwen3.6 Plus

    Score 85.9 · input · context

    SWE 78.8%
    GPQA 90.4%
    MMLU-Pro 88.5%
  48. 48

    Grok-2 mini

    Score 85.7 · input · context

    HumanEval 85.7%
  49. 49

    GPT-5.5 Instant

    Score 85.6 · input · context

    GPQA 85.6%
  50. 50

    Gemma 3 12B

    Score 85.4 · input · 128K context

    HumanEval 85.4%

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.