HumanEval Benchmark Leaderboard
A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
Leaderboard
Top 50 models on HumanEval Benchmark Leaderboard (scores from public evaluations).
- 1MiniCPM-SALA95.1% on HumanEval Benchmark Leaderboard
- 2Kimi K2 090594.5% on HumanEval Benchmark Leaderboard
- 3Claude 3.5 Sonnet93.7% on HumanEval Benchmark Leaderboard
- 4GPT-593.4% on HumanEval Benchmark Leaderboard
- 5Kimi K2 Instruct93.3% on HumanEval Benchmark Leaderboard
- 6Qwen2.5-Coder 32B Instruct92.7% on HumanEval Benchmark Leaderboard
- 7o1-mini92.4% on HumanEval Benchmark Leaderboard
- 8Sarvam-30B92.1% on HumanEval Benchmark Leaderboard
- 9Claude 3.5 Sonnet92.0% on HumanEval Benchmark Leaderboard
- 9Mistral Large 292.0% on HumanEval Benchmark Leaderboard
- 11Qwen2.5 VL 32B Instruct91.5% on HumanEval Benchmark Leaderboard
- 12GPT-4o90.2% on HumanEval Benchmark Leaderboard
- 13Granite 3.3 8B Instruct89.7% on HumanEval Benchmark Leaderboard
- 13Granite 3.3 8B Base89.7% on HumanEval Benchmark Leaderboard
- 15Gemini Diffusion89.6% on HumanEval Benchmark Leaderboard
- 16Nova Pro89.0% on HumanEval Benchmark Leaderboard
- 16Llama 3.1 405B Instruct89.0% on HumanEval Benchmark Leaderboard
- 16DeepSeek-V2.589.0% on HumanEval Benchmark Leaderboard
- 19LongCat-Flash-Chat88.4% on HumanEval Benchmark Leaderboard
- 19Mistral Small 3.1 24B Instruct88.4% on HumanEval Benchmark Leaderboard
- 21Grok-288.4% on HumanEval Benchmark Leaderboard
- 21Qwen2.5 32B Instruct88.4% on HumanEval Benchmark Leaderboard
- 21Qwen2.5-Coder 7B Instruct88.4% on HumanEval Benchmark Leaderboard
- 21Llama 3.3 70B Instruct88.4% on HumanEval Benchmark Leaderboard
- 25o188.1% on HumanEval Benchmark Leaderboard
- 25Claude 3.5 Haiku88.1% on HumanEval Benchmark Leaderboard
- 27GPT-4.588.0% on HumanEval Benchmark Leaderboard
- 28Gemma 3 27B87.8% on HumanEval Benchmark Leaderboard
- 29GPT-4o mini87.2% on HumanEval Benchmark Leaderboard
- 30GPT-4 Turbo87.1% on HumanEval Benchmark Leaderboard
- 31Qwen2.5 72B Instruct86.6% on HumanEval Benchmark Leaderboard
- 32Qwen2 72B Instruct86.0% on HumanEval Benchmark Leaderboard
- 33Grok-2 mini85.7% on HumanEval Benchmark Leaderboard
- 34Gemma 3 12B85.4% on HumanEval Benchmark Leaderboard
- 34Nova Lite85.4% on HumanEval Benchmark Leaderboard
- 36Claude 3 Opus84.9% on HumanEval Benchmark Leaderboard
- 37Qwen2.5 7B Instruct84.8% on HumanEval Benchmark Leaderboard
- 37Mistral Small 3 24B Instruct84.8% on HumanEval Benchmark Leaderboard
- 39Gemini 1.5 Pro84.1% on HumanEval Benchmark Leaderboard
- 40Qwen2.5 14B Instruct83.5% on HumanEval Benchmark Leaderboard
- 41Phi 482.6% on HumanEval Benchmark Leaderboard
- 42IBM Granite 4.0 Tiny Preview82.4% on HumanEval Benchmark Leaderboard
- 43Nova Micro81.1% on HumanEval Benchmark Leaderboard
- 43Codestral-22B81.1% on HumanEval Benchmark Leaderboard
- 45Llama 3.1 70B Instruct80.5% on HumanEval Benchmark Leaderboard
- 46Qwen2 7B Instruct79.9% on HumanEval Benchmark Leaderboard
- 47Qwen2.5-Omni-7B78.7% on HumanEval Benchmark Leaderboard
- 48Claude 3 Haiku75.9% on HumanEval Benchmark Leaderboard
- 49Gemma 3n E4B Instructed LiteRT Preview75.0% on HumanEval Benchmark Leaderboard
- 49Gemma 3n E4B Instructed75.0% on HumanEval Benchmark Leaderboard
Models tracked
Models with humaneval in their evaluation profile.
- ChatGPT-4o Latest
- Claude 3.5 HaikuAnthropic
- Claude 3.5 Sonnet
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
- Claude 3 Haiku
- Claude 3 OpusAnthropic
- Claude 3 Sonnet
- Claude Haiku 4.5Anthropic
- Claude Mythos PreviewAnthropic
- Claude Opus 4.1
- Claude Opus 4Anthropic
- Claude Opus 4.5
- Claude Opus 4.6Anthropic
- Claude Opus 4.7Anthropic
- Claude Sonnet 4
- Claude Sonnet 4.5
- Claude Sonnet 4.6
- Codestral-22B
- Command R+
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- DeepSeek-V3.2 (Non-thinking)DeepSeek
- DeepSeek-R1-0528DeepSeek
- DeepSeek R1 Distill Llama 70BDeepSeek
- DeepSeek R1 Distill Llama 8BDeepSeek
- DeepSeek R1 Distill Qwen 14BDeepSeek
- DeepSeek R1 Distill Qwen 32BDeepSeek
- DeepSeek R1 Distill Qwen 7BDeepSeek
- DeepSeek R1 ZeroOpenAI
- DeepSeek-V3.2 (Thinking)DeepSeek
- DeepSeek-V2.5DeepSeek
- DeepSeek-V3 0324
- DeepSeek-V3.1DeepSeek
- DeepSeek-V3.2-ExpDeepSeek
- DeepSeek-V3.2-SpecialeDeepSeek
- DeepSeek-V3.2DeepSeek
- DeepSeek-V3
- DeepSeek-V4-Flash-MaxDeepSeek
- DeepSeek-V4-Pro-MaxDeepSeek
- DeepSeek VL2 SmallDeepSeek
- DeepSeek VL2 TinyDeepSeek
- DeepSeek VL2DeepSeek
- ERNIE 4.5
- ERNIE 5.0
- Gemini 1.0 Pro
- Gemini 1.5 Flash 8B