HumanEval Benchmark Leaderboard

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Leaderboard

Top 50 models on HumanEval Benchmark Leaderboard (scores from public evaluations).

  1. 1MiniCPM-SALA95.1% on HumanEval Benchmark Leaderboard
  2. 2Kimi K2 090594.5% on HumanEval Benchmark Leaderboard
  3. 3Claude 3.5 Sonnet93.7% on HumanEval Benchmark Leaderboard
  4. 4GPT-593.4% on HumanEval Benchmark Leaderboard
  5. 5Kimi K2 Instruct93.3% on HumanEval Benchmark Leaderboard
  6. 6Qwen2.5-Coder 32B Instruct92.7% on HumanEval Benchmark Leaderboard
  7. 7o1-mini92.4% on HumanEval Benchmark Leaderboard
  8. 8Sarvam-30B92.1% on HumanEval Benchmark Leaderboard
  9. 9Claude 3.5 Sonnet92.0% on HumanEval Benchmark Leaderboard
  10. 9Mistral Large 292.0% on HumanEval Benchmark Leaderboard
  11. 11Qwen2.5 VL 32B Instruct91.5% on HumanEval Benchmark Leaderboard
  12. 12GPT-4o90.2% on HumanEval Benchmark Leaderboard
  13. 13Granite 3.3 8B Instruct89.7% on HumanEval Benchmark Leaderboard
  14. 13Granite 3.3 8B Base89.7% on HumanEval Benchmark Leaderboard
  15. 15Gemini Diffusion89.6% on HumanEval Benchmark Leaderboard
  16. 16Nova Pro89.0% on HumanEval Benchmark Leaderboard
  17. 16Llama 3.1 405B Instruct89.0% on HumanEval Benchmark Leaderboard
  18. 16DeepSeek-V2.589.0% on HumanEval Benchmark Leaderboard
  19. 19LongCat-Flash-Chat88.4% on HumanEval Benchmark Leaderboard
  20. 19Mistral Small 3.1 24B Instruct88.4% on HumanEval Benchmark Leaderboard
  21. 21Grok-288.4% on HumanEval Benchmark Leaderboard
  22. 21Qwen2.5 32B Instruct88.4% on HumanEval Benchmark Leaderboard
  23. 21Qwen2.5-Coder 7B Instruct88.4% on HumanEval Benchmark Leaderboard
  24. 21Llama 3.3 70B Instruct88.4% on HumanEval Benchmark Leaderboard
  25. 25o188.1% on HumanEval Benchmark Leaderboard
  26. 25Claude 3.5 Haiku88.1% on HumanEval Benchmark Leaderboard
  27. 27GPT-4.588.0% on HumanEval Benchmark Leaderboard
  28. 28Gemma 3 27B87.8% on HumanEval Benchmark Leaderboard
  29. 29GPT-4o mini87.2% on HumanEval Benchmark Leaderboard
  30. 30GPT-4 Turbo87.1% on HumanEval Benchmark Leaderboard
  31. 31Qwen2.5 72B Instruct86.6% on HumanEval Benchmark Leaderboard
  32. 32Qwen2 72B Instruct86.0% on HumanEval Benchmark Leaderboard
  33. 33Grok-2 mini85.7% on HumanEval Benchmark Leaderboard
  34. 34Gemma 3 12B85.4% on HumanEval Benchmark Leaderboard
  35. 34Nova Lite85.4% on HumanEval Benchmark Leaderboard
  36. 36Claude 3 Opus84.9% on HumanEval Benchmark Leaderboard
  37. 37Qwen2.5 7B Instruct84.8% on HumanEval Benchmark Leaderboard
  38. 37Mistral Small 3 24B Instruct84.8% on HumanEval Benchmark Leaderboard
  39. 39Gemini 1.5 Pro84.1% on HumanEval Benchmark Leaderboard
  40. 40Qwen2.5 14B Instruct83.5% on HumanEval Benchmark Leaderboard
  41. 41Phi 482.6% on HumanEval Benchmark Leaderboard
  42. 42IBM Granite 4.0 Tiny Preview82.4% on HumanEval Benchmark Leaderboard
  43. 43Nova Micro81.1% on HumanEval Benchmark Leaderboard
  44. 43Codestral-22B81.1% on HumanEval Benchmark Leaderboard
  45. 45Llama 3.1 70B Instruct80.5% on HumanEval Benchmark Leaderboard
  46. 46Qwen2 7B Instruct79.9% on HumanEval Benchmark Leaderboard
  47. 47Qwen2.5-Omni-7B78.7% on HumanEval Benchmark Leaderboard
  48. 48Claude 3 Haiku75.9% on HumanEval Benchmark Leaderboard
  49. 49Gemma 3n E4B Instructed LiteRT Preview75.0% on HumanEval Benchmark Leaderboard
  50. 49Gemma 3n E4B Instructed75.0% on HumanEval Benchmark Leaderboard

Models tracked

Models with humaneval in their evaluation profile.

View task leaderboards →