MMLU-Pro Benchmark Leaderboard

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

Leaderboard

Top 50 models on MMLU-Pro Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Qwen3.6 Plus	88.5%	—
2	MiniMax M2.1	88.0%	—
3	Qwen3.5-397B-A17B	87.8%	—
4	DeepSeek-V4-Pro-Max	87.5%	—
5	Kimi K2.5	87.1%	—
6	ERNIE 5.0	87.0%	—
7	Qwen3.5-122B-A10B	86.7%	—
8	DeepSeek-V4-Flash-Max	86.2%	—
8	Qwen3.6-27B	86.2%	—
10	Qwen3.5-27B	86.1%	—
11	Qwen3.5-35B-A3B	85.3%	—
12	Qwen3.6-35B-A3B	85.2%	—
12	Gemma 4 31B	85.2%	—
14	DeepSeek-V3.2-Exp	85.0%	—
14	DeepSeek-R1-0528	85.0%	—
14	DeepSeek-V3.2 (Thinking)	85.0%	—
14	DeepSeek-V3.2	85.0%	—
18	MiMo-V2-Flash	84.9%	—
19	GLM-4.5	84.6%	—
19	Kimi K2-Thinking-0905	84.6%	—
21	Qwen3-235B-A22B-Thinking-2507	84.4%	—
22	GLM-4.7	84.3%	—
23	K-EXAONE-236B-A23B	83.8%	—
23	Qwen3 VL 235B A22B Thinking	83.8%	—
25	Nemotron 3 Super (120B A12B)	83.7%	—
26	DeepSeek-V3.1	83.7%	—
27	Qwen3-235B-A22B-Instruct-2507	83.0%	—
28	Qwen3-Next-80B-A3B-Thinking	82.7%	—
29	LongCat-Flash-Chat	82.7%	—
30	Gemma 4 26B-A4B	82.6%	—
30	LongCat-Flash-Thinking	82.6%	—
32	Kimi K2 0905	82.5%	—
32	Qwen3.5-9B	82.5%	—
34	Qwen3 VL 32B Thinking	82.1%	—
35	MiniMax M2	82.0%	—
36	Qwen3 VL 235B A22B Instruct	81.8%	—
37	Sarvam-105B	81.7%	—
38	GLM-4.5-Air	81.4%	—
39	DeepSeek-V3 0324	81.2%	—
40	MiniMax M1 80K	81.1%	—
40	Kimi K2-Instruct-0905	81.1%	—
40	Kimi K2 Instruct	81.1%	—
43	GPT OSS 120B High	80.7%	—
44	MiniMax M1 40K	80.6%	—
44	Qwen3-Next-80B-A3B-Instruct	80.6%	—
46	Qwen3 VL 30B A3B Thinking	80.5%	—
46	Llama 4 Maverick	80.5%	—
48	Sarvam-30B	80.0%	—
49	Qwen3.5-4B	79.1%	—
50	Qwen3 VL 32B Instruct	78.6%	—

Models tracked

Models with mmlu-pro in their evaluation profile.

View task leaderboards →