Best AI for Coding

Build, debug, and ship software

50 models · 6 benchmarks · Ranked by normalized public benchmark scores (SWE-Bench, HumanEval, and related evaluations). Arena live-vote rankings require llm-stats live data — not in our static export yet.

  1. 1

    MiniCPM-SALA

    Score 95.1 · input · 256K context

    HumanEval 95.1%
  2. 2

    Kimi K2 0905

    Score 94.5 · input · 256K context

    HumanEval 94.5%
  3. 3

    Claude 3.5 Sonnet

    Score 93.7 · input · context

    HumanEval 93.7%
  4. 4

    Qwen2.5-Coder 32B Instruct

    Score 92.7 · input · 128K context

    HumanEval 92.7%
  5. 5

    o1-mini

    Score 92.4 · input · context

    HumanEval 92.4%
  6. 6

    Sarvam-30B

    Score 92.1 · input · context

    HumanEval 92.1%
  7. 7

    Claude 3.5 Sonnet

    Score 92.0 · input · context

    HumanEval 92.0%
  8. 8

    Mistral Large 2

    Score 92.0 · input · 128K context

    HumanEval 92.0%
  9. 9

    Qwen2.5 VL 32B Instruct

    Score 91.5 · input · context

    HumanEval 91.5%
  10. 10

    GPT-4o

    Score 90.2 · input · context

    HumanEval 90.2%
  11. 11

    Granite 3.3 8B Base

    Score 89.7 · input · 128K context

    HumanEval 89.7%
  12. 12

    Granite 3.3 8B Instruct

    Score 89.7 · input · 128K context

    HumanEval 89.7%
  13. 13

    Gemini Diffusion

    Score 89.6 · input · context

    HumanEval 89.6%
  14. 14

    DeepSeek-V2.5

    Score 89.0 · input · context

    HumanEval 89.0%
  15. 15

    Llama 3.1 405B Instruct

    Score 89.0 · input · 128K context

    HumanEval 89.0%
  16. 16

    Nova Pro

    Score 89.0 · input · context

    HumanEval 89.0%
  17. 17

    Mistral Small 3.1 24B Instruct

    Score 88.4 · input · 128K context

    HumanEval 88.4%
  18. 18

    Grok-2

    Score 88.4 · input · context

    HumanEval 88.4%
  19. 19

    Llama 3.3 70B Instruct

    Score 88.4 · input · 128K context

    HumanEval 88.4%
  20. 20

    Qwen2.5 32B Instruct

    Score 88.4 · input · 8K context

    HumanEval 88.4%
  21. 21

    Qwen2.5-Coder 7B Instruct

    Score 88.4 · input · 128K context

    HumanEval 88.4%
  22. 22

    Claude 3.5 Haiku

    Score 88.1 · input · context

    HumanEval 88.1%
  23. 23

    o1

    Score 88.1 · input · context

    HumanEval 88.1%
  24. 24

    GPT-4.5

    Score 88.0 · input · 128K context

    HumanEval 88.0%
  25. 25

    Gemma 3 27B

    Score 87.8 · input · 128K context

    HumanEval 87.8%
  26. 26

    GPT-4o mini

    Score 87.2 · input · 128K context

    HumanEval 87.2%
  27. 27

    GPT-4 Turbo

    Score 87.1 · input · context

    HumanEval 87.1%
  28. 28

    Qwen2 72B Instruct

    Score 86.0 · input · 131K context

    HumanEval 86.0%
  29. 29

    Grok-2 mini

    Score 85.7 · input · context

    HumanEval 85.7%
  30. 30

    Gemma 3 12B

    Score 85.4 · input · 128K context

    HumanEval 85.4%
  31. 31

    Nova Lite

    Score 85.4 · input · context

    HumanEval 85.4%
  32. 32

    Claude Mythos Preview

    Score 85.3 · input · 128K context

    SWE 93.9%
    SWE Pro 77.8%
    Terminal 82.0%
    SWE-i18n 87.3%
  33. 33

    Claude 3 Opus

    Score 84.9 · input · context

    HumanEval 84.9%
  34. 34

    Mistral Small 3 24B Instruct

    Score 84.8 · input · context

    HumanEval 84.8%
  35. 35

    Qwen2.5 7B Instruct

    Score 84.8 · input · 8K context

    HumanEval 84.8%
  36. 36

    GPT-5

    Score 84.2 · input · 128K context

    SWE 74.9%
    HumanEval 93.4%
  37. 37

    Gemini 1.5 Pro

    Score 84.1 · input · context

    HumanEval 84.1%
  38. 38

    Qwen2.5 14B Instruct

    Score 83.5 · input · 128K context

    HumanEval 83.5%
  39. 39

    Phi 4

    Score 82.6 · input · context

    HumanEval 82.6%
  40. 40

    IBM Granite 4.0 Tiny Preview

    Score 82.4 · input · 128K context

    HumanEval 82.4%
  41. 41

    Codestral-22B

    Score 81.1 · input · context

    HumanEval 81.1%
  42. 42

    Nova Micro

    Score 81.1 · input · context

    HumanEval 81.1%
  43. 43

    Llama 3.1 70B Instruct

    Score 80.5 · input · context

    HumanEval 80.5%
  44. 44

    Grok-3 Mini

    Score 80.4 · input · context

    LCB 80.4%
  45. 45

    GPT-5.2

    Score 80.0 · $1.75/1M input · 256K context

    SWE 80.0%
  46. 46

    Grok 4 Fast

    Score 80.0 · input · context

    LCB 80.0%
  47. 47

    Qwen2 7B Instruct

    Score 79.9 · input · 131K context

    HumanEval 79.9%
  48. 48

    Grok-3

    Score 79.4 · input · context

    LCB 79.4%
  49. 49

    Grok-4 Heavy

    Score 79.4 · input · context

    LCB 79.4%
  50. 50

    LongCat-Flash-Thinking

    Score 79.4 · input · context

    LCB 79.4%

How this table works

Each column links to a public benchmark leaderboard. The Score column is the average of normalized benchmark results for that model in this category (0–100 scale). Models ranked higher appear on more coding-related evaluations with stronger scores — similar in spirit to llm-stats, but we do not yet include live coding-arena TrueSkill or API latency columns from their live product.

Coding arenas on AICompare list arena types; full Elo tables will ship when we connect Supabase or llm-stats API refresh.

Looking for SaaS tools? Browse categories or compare tools.