Humanity's Last Exam Benchmark Leaderboard

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Leaderboard

Top 50 models on Humanity's Last Exam Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Claude Mythos Preview	64.7%	—
2	Muse Spark	58.4%	—
3	GPT-5.5 Pro	57.2%	—
4	Claude Opus 4.7	54.7%	—
5	Claude Opus 4.6	53.1%	—
6	GLM-5.1	52.3%	—
7	GPT-5.5	52.2%	—
8	Gemini 3.1 Pro	51.4%	—
9	Kimi K2-Thinking-0905	51.0%	—
10	Grok-4 Heavy	50.7%	—
11	Kimi K2.5	50.2%	—
12	Claude Sonnet 4.6	49.0%	—
13	Qwen3.5-27B	48.5%	—
14	DeepSeek-V4-Pro-Max	48.2%	—
15	Qwen3.5-122B-A10B	47.5%	—
16	Qwen3.5-35B-A3B	47.4%	—
17	Gemini 3 Pro	45.8%	—
18	DeepSeek-V4-Flash-Max	45.1%	—
19	Gemini 3 Flash	43.5%	—
20	GLM-4.7	42.8%	—
21	DeepSeek-V3.2	40.8%	—
22	Gemini 3.5 Flash	40.2%	—
23	Grok-4	40.0%	—
24	GPT-5.4	39.8%	—
25	ERNIE 5.0	39.0%	—
26	GPT-5.2 Pro	36.6%	—
27	Kimi K2.6	36.4%	—
28	GPT-5.2	34.5%	—
29	DeepSeek-V3.2-Speciale	30.6%	—
30	Qwen3.6 Plus	28.8%	—
31	Qwen3.5-397B-A17B	28.7%	—
32	GPT-5.4 mini	28.2%	—
33	Gemma 4 31B	26.5%	—
34	LongCat-Flash-Thinking-2601	25.2%	—
35	DeepSeek-V3.2 (Thinking)	25.1%	—
36	GPT-5	24.8%	—
37	GPT-5.4 nano	24.3%	—
38	Qwen3.6-27B	24.0%	—
39	Nemotron 3 Super (120B A12B)	22.8%	—
40	MiMo-V2-Flash	22.1%	—
41	MiniMax M2.1	22.0%	—
42	Gemini 2.5 Pro Preview 06-05	21.6%	—
43	Qwen3.6-35B-A3B	21.4%	—
44	Grok 4 Fast	20.0%	—
45	DeepSeek-V3.2-Exp	19.8%	—
46	Qwen3-235B-A22B-Thinking-2507	18.2%	—
47	Gemini 2.5 Pro	17.8%	—
48	DeepSeek-R1-0528	17.7%	—
49	GLM-4.6	17.2%	—
49	Gemma 4 26B-A4B	17.2%	—

Models tracked

Models with humanity's-last-exam in their evaluation profile.

No models linked yet.

View task leaderboards →