GPQA Benchmark Leaderboard
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
Leaderboard
Top 50 models on GPQA Benchmark Leaderboard (scores from public evaluations).
- 1Claude Mythos Preview94.6% on GPQA Benchmark Leaderboard
- 2Gemini 3.1 Pro94.3% on GPQA Benchmark Leaderboard
- 3Claude Opus 4.794.2% on GPQA Benchmark Leaderboard
- 4GPT-5.593.6% on GPQA Benchmark Leaderboard
- 5GPT-5.2 Pro93.2% on GPQA Benchmark Leaderboard
- 6GPT-5.492.8% on GPQA Benchmark Leaderboard
- 7GPT-5.292.4% on GPQA Benchmark Leaderboard
- 8Gemini 3 Pro91.9% on GPQA Benchmark Leaderboard
- 9Claude Opus 4.691.3% on GPQA Benchmark Leaderboard
- 10Kimi K2.690.5% on GPQA Benchmark Leaderboard
- 11Gemini 3 Flash90.4% on GPQA Benchmark Leaderboard
- 11Qwen3.6 Plus90.4% on GPQA Benchmark Leaderboard
- 13DeepSeek-V4-Pro-Max90.1% on GPQA Benchmark Leaderboard
- 14Claude Sonnet 4.689.9% on GPQA Benchmark Leaderboard
- 15Muse Spark89.5% on GPQA Benchmark Leaderboard
- 16Seed 2.0 Pro88.9% on GPQA Benchmark Leaderboard
- 17Grok-4 Heavy88.4% on GPQA Benchmark Leaderboard
- 17Qwen3.5-397B-A17B88.4% on GPQA Benchmark Leaderboard
- 19GPT-5.188.1% on GPQA Benchmark Leaderboard
- 19GPT-5.1 Thinking88.1% on GPQA Benchmark Leaderboard
- 19GPT-5.1 High88.1% on GPQA Benchmark Leaderboard
- 19GPT-5 Medium88.1% on GPQA Benchmark Leaderboard
- 19DeepSeek-V4-Flash-Max88.1% on GPQA Benchmark Leaderboard
- 19GPT-5.1 Instant88.1% on GPQA Benchmark Leaderboard
- 25GPT-5.4 mini88.0% on GPQA Benchmark Leaderboard
- 26Qwen3.6-27B87.8% on GPQA Benchmark Leaderboard
- 27Kimi K2.587.6% on GPQA Benchmark Leaderboard
- 28Grok-487.5% on GPQA Benchmark Leaderboard
- 29GPT-5 High87.3% on GPQA Benchmark Leaderboard
- 30Claude Opus 4.587.0% on GPQA Benchmark Leaderboard
- 31Gemini 3.1 Flash-Lite86.9% on GPQA Benchmark Leaderboard
- 32Qwen3.5-122B-A10B86.6% on GPQA Benchmark Leaderboard
- 33Gemini 2.5 Pro Preview 06-0586.4% on GPQA Benchmark Leaderboard
- 34GLM-5.186.2% on GPQA Benchmark Leaderboard
- 35Qwen3.6-35B-A3B86.0% on GPQA Benchmark Leaderboard
- 36GPT-585.7% on GPQA Benchmark Leaderboard
- 36GLM-4.785.7% on GPQA Benchmark Leaderboard
- 36Grok 4 Fast85.7% on GPQA Benchmark Leaderboard
- 39GPT-5.5 Instant85.6% on GPQA Benchmark Leaderboard
- 40Qwen3.5-27B85.5% on GPQA Benchmark Leaderboard
- 41Seed 2.0 Lite85.1% on GPQA Benchmark Leaderboard
- 42ERNIE 5.085.0% on GPQA Benchmark Leaderboard
- 43Claude 3.7 Sonnet84.8% on GPQA Benchmark Leaderboard
- 44Grok-384.6% on GPQA Benchmark Leaderboard
- 45Kimi K2-Thinking-090584.5% on GPQA Benchmark Leaderboard
- 46Gemma 4 31B84.3% on GPQA Benchmark Leaderboard
- 47Qwen3.5-35B-A3B84.2% on GPQA Benchmark Leaderboard
- 48ChatGPT-4o Latest84.0% on GPQA Benchmark Leaderboard
- 48Grok-3 Mini84.0% on GPQA Benchmark Leaderboard
- 50MiMo-V2-Flash83.7% on GPQA Benchmark Leaderboard
Models tracked
Models with gpqa in their evaluation profile.
- ChatGPT-4o Latest
- Claude 3.5 HaikuAnthropic
- Claude 3.5 Sonnet
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
- Claude 3 Haiku
- Claude 3 OpusAnthropic
- Claude 3 Sonnet
- Claude Haiku 4.5Anthropic
- Claude Mythos PreviewAnthropic
- Claude Opus 4.1
- Claude Opus 4Anthropic
- Claude Opus 4.5
- Claude Opus 4.6Anthropic
- Claude Opus 4.7Anthropic
- Claude Sonnet 4
- Claude Sonnet 4.5
- Claude Sonnet 4.6
- Codestral-22B
- Command R+
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- Compare AI Models: Side-by-Side
- DeepSeek-V3.2 (Non-thinking)DeepSeek
- DeepSeek-R1-0528DeepSeek
- DeepSeek R1 Distill Llama 70BDeepSeek
- DeepSeek R1 Distill Llama 8BDeepSeek
- DeepSeek R1 Distill Qwen 14BDeepSeek
- DeepSeek R1 Distill Qwen 32BDeepSeek
- DeepSeek R1 Distill Qwen 7BDeepSeek
- DeepSeek R1 ZeroOpenAI
- DeepSeek-V3.2 (Thinking)DeepSeek
- DeepSeek-V2.5DeepSeek
- DeepSeek-V3 0324
- DeepSeek-V3.1DeepSeek
- DeepSeek-V3.2-ExpDeepSeek
- DeepSeek-V3.2-SpecialeDeepSeek
- DeepSeek-V3.2DeepSeek
- DeepSeek-V3
- DeepSeek-V4-Flash-MaxDeepSeek
- DeepSeek-V4-Pro-MaxDeepSeek
- DeepSeek VL2 SmallDeepSeek
- DeepSeek VL2 TinyDeepSeek
- DeepSeek VL2DeepSeek
- ERNIE 4.5
- ERNIE 5.0
- Gemini 1.0 Pro
- Gemini 1.5 Flash 8B