GPQA Benchmark Leaderboard

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

Leaderboard

Top 50 models on GPQA Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Claude Mythos Preview	94.6%	—
2	Gemini 3.1 Pro	94.3%	—
3	Claude Opus 4.7	94.2%	—
4	GPT-5.5	93.6%	—
5	GPT-5.2 Pro	93.2%	—
6	GPT-5.4	92.8%	—
7	GPT-5.2	92.4%	—
8	Gemini 3 Pro	91.9%	—
9	Claude Opus 4.6	91.3%	—
10	Kimi K2.6	90.5%	—
11	Gemini 3 Flash	90.4%	—
11	Qwen3.6 Plus	90.4%	—
13	DeepSeek-V4-Pro-Max	90.1%	—
14	Claude Sonnet 4.6	89.9%	—
15	Muse Spark	89.5%	—
16	Seed 2.0 Pro	88.9%	—
17	Grok-4 Heavy	88.4%	—
17	Qwen3.5-397B-A17B	88.4%	—
19	GPT-5.1	88.1%	—
19	GPT-5.1 Thinking	88.1%	—
19	GPT-5.1 High	88.1%	—
19	GPT-5 Medium	88.1%	—
19	DeepSeek-V4-Flash-Max	88.1%	—
19	GPT-5.1 Instant	88.1%	—
25	GPT-5.4 mini	88.0%	—
26	Qwen3.6-27B	87.8%	—
27	Kimi K2.5	87.6%	—
28	Grok-4	87.5%	—
29	GPT-5 High	87.3%	—
30	Claude Opus 4.5	87.0%	—
31	Gemini 3.1 Flash-Lite	86.9%	—
32	Qwen3.5-122B-A10B	86.6%	—
33	Gemini 2.5 Pro Preview 06-05	86.4%	—
34	GLM-5.1	86.2%	—
35	Qwen3.6-35B-A3B	86.0%	—
36	GPT-5	85.7%	—
36	GLM-4.7	85.7%	—
36	Grok 4 Fast	85.7%	—
39	GPT-5.5 Instant	85.6%	—
40	Qwen3.5-27B	85.5%	—
41	Seed 2.0 Lite	85.1%	—
42	ERNIE 5.0	85.0%	—
43	Claude 3.7 Sonnet	84.8%	—
44	Grok-3	84.6%	—
45	Kimi K2-Thinking-0905	84.5%	—
46	Gemma 4 31B	84.3%	—
47	Qwen3.5-35B-A3B	84.2%	—
48	ChatGPT-4o Latest	84.0%	—
48	Grok-3 Mini	84.0%	—
50	MiMo-V2-Flash	83.7%	—

Models tracked

Models with gpqa in their evaluation profile.

View task leaderboards →