MMLU-Pro Benchmark Leaderboard
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Leaderboard
Top 50 models on MMLU-Pro Benchmark Leaderboard (scores from public evaluations).
- 1Qwen3.6 Plus88.5% on MMLU-Pro Benchmark Leaderboard
- 2MiniMax M2.188.0% on MMLU-Pro Benchmark Leaderboard
- 3Qwen3.5-397B-A17B87.8% on MMLU-Pro Benchmark Leaderboard
- 4DeepSeek-V4-Pro-Max87.5% on MMLU-Pro Benchmark Leaderboard
- 5Kimi K2.587.1% on MMLU-Pro Benchmark Leaderboard
- 6ERNIE 5.087.0% on MMLU-Pro Benchmark Leaderboard
- 7Qwen3.5-122B-A10B86.7% on MMLU-Pro Benchmark Leaderboard
- 8DeepSeek-V4-Flash-Max86.2% on MMLU-Pro Benchmark Leaderboard
- 8Qwen3.6-27B86.2% on MMLU-Pro Benchmark Leaderboard
- 10Qwen3.5-27B86.1% on MMLU-Pro Benchmark Leaderboard
- 11Qwen3.5-35B-A3B85.3% on MMLU-Pro Benchmark Leaderboard
- 12Qwen3.6-35B-A3B85.2% on MMLU-Pro Benchmark Leaderboard
- 12Gemma 4 31B85.2% on MMLU-Pro Benchmark Leaderboard
- 14DeepSeek-V3.2-Exp85.0% on MMLU-Pro Benchmark Leaderboard
- 14DeepSeek-R1-052885.0% on MMLU-Pro Benchmark Leaderboard
- 14DeepSeek-V3.2 (Thinking)85.0% on MMLU-Pro Benchmark Leaderboard
- 14DeepSeek-V3.285.0% on MMLU-Pro Benchmark Leaderboard
- 18MiMo-V2-Flash84.9% on MMLU-Pro Benchmark Leaderboard
- 19GLM-4.584.6% on MMLU-Pro Benchmark Leaderboard
- 19Kimi K2-Thinking-090584.6% on MMLU-Pro Benchmark Leaderboard
- 21Qwen3-235B-A22B-Thinking-250784.4% on MMLU-Pro Benchmark Leaderboard
- 22GLM-4.784.3% on MMLU-Pro Benchmark Leaderboard
- 23K-EXAONE-236B-A23B83.8% on MMLU-Pro Benchmark Leaderboard
- 23Qwen3 VL 235B A22B Thinking83.8% on MMLU-Pro Benchmark Leaderboard
- 25Nemotron 3 Super (120B A12B)83.7% on MMLU-Pro Benchmark Leaderboard
- 26DeepSeek-V3.183.7% on MMLU-Pro Benchmark Leaderboard
- 27Qwen3-235B-A22B-Instruct-250783.0% on MMLU-Pro Benchmark Leaderboard
- 28Qwen3-Next-80B-A3B-Thinking82.7% on MMLU-Pro Benchmark Leaderboard
- 29LongCat-Flash-Chat82.7% on MMLU-Pro Benchmark Leaderboard
- 30Gemma 4 26B-A4B82.6% on MMLU-Pro Benchmark Leaderboard
- 30LongCat-Flash-Thinking82.6% on MMLU-Pro Benchmark Leaderboard
- 32Kimi K2 090582.5% on MMLU-Pro Benchmark Leaderboard
- 32Qwen3.5-9B82.5% on MMLU-Pro Benchmark Leaderboard
- 34Qwen3 VL 32B Thinking82.1% on MMLU-Pro Benchmark Leaderboard
- 35MiniMax M282.0% on MMLU-Pro Benchmark Leaderboard
- 36Qwen3 VL 235B A22B Instruct81.8% on MMLU-Pro Benchmark Leaderboard
- 37Sarvam-105B81.7% on MMLU-Pro Benchmark Leaderboard
- 38GLM-4.5-Air81.4% on MMLU-Pro Benchmark Leaderboard
- 39DeepSeek-V3 032481.2% on MMLU-Pro Benchmark Leaderboard
- 40MiniMax M1 80K81.1% on MMLU-Pro Benchmark Leaderboard
- 40Kimi K2-Instruct-090581.1% on MMLU-Pro Benchmark Leaderboard
- 40Kimi K2 Instruct81.1% on MMLU-Pro Benchmark Leaderboard
- 43GPT OSS 120B High80.7% on MMLU-Pro Benchmark Leaderboard
- 44MiniMax M1 40K80.6% on MMLU-Pro Benchmark Leaderboard
- 44Qwen3-Next-80B-A3B-Instruct80.6% on MMLU-Pro Benchmark Leaderboard
- 46Qwen3 VL 30B A3B Thinking80.5% on MMLU-Pro Benchmark Leaderboard
- 46Llama 4 Maverick80.5% on MMLU-Pro Benchmark Leaderboard
- 48Sarvam-30B80.0% on MMLU-Pro Benchmark Leaderboard
- 49Qwen3.5-4B79.1% on MMLU-Pro Benchmark Leaderboard
- 50Qwen3 VL 32B Instruct78.6% on MMLU-Pro Benchmark Leaderboard
Models tracked
Models with mmlu-pro in their evaluation profile.
- ChatGPT-4o Latest
- Claude 3.5 HaikuAnthropic
- Claude 3.5 Sonnet
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
- Claude 3 Haiku
- Claude 3 OpusAnthropic
- Claude 3 Sonnet
- Claude Haiku 4.5Anthropic
- Claude Mythos PreviewAnthropic
- Claude Opus 4.1
- Claude Opus 4Anthropic
- Claude Opus 4.5
- Claude Opus 4.6Anthropic
- Claude Opus 4.7Anthropic
- Claude Sonnet 4
- Claude Sonnet 4.5
- Claude Sonnet 4.6
- Codestral-22B
- Command R+
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- DeepSeek-V3.2 (Non-thinking)DeepSeek
- DeepSeek-R1-0528DeepSeek
- DeepSeek R1 Distill Llama 70BDeepSeek
- DeepSeek R1 Distill Llama 8BDeepSeek
- DeepSeek R1 Distill Qwen 14BDeepSeek
- DeepSeek R1 Distill Qwen 32BDeepSeek
- DeepSeek R1 Distill Qwen 7BDeepSeek
- DeepSeek R1 ZeroOpenAI
- DeepSeek-V3.2 (Thinking)DeepSeek
- DeepSeek-V2.5DeepSeek
- DeepSeek-V3 0324
- DeepSeek-V3.1DeepSeek
- DeepSeek-V3.2-ExpDeepSeek
- DeepSeek-V3.2-SpecialeDeepSeek
- DeepSeek-V3.2DeepSeek
- DeepSeek-V3
- DeepSeek-V4-Flash-MaxDeepSeek
- DeepSeek-V4-Pro-MaxDeepSeek
- DeepSeek VL2 SmallDeepSeek
- DeepSeek VL2 TinyDeepSeek
- DeepSeek VL2DeepSeek
- ERNIE 4.5
- ERNIE 5.0
- Gemini 1.0 Pro
- Gemini 1.5 Flash 8B