GPQA Benchmark Leaderboard

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

Leaderboard

Top 50 models on GPQA Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Mythos Preview94.6% on GPQA Benchmark Leaderboard
  2. 2Gemini 3.1 Pro94.3% on GPQA Benchmark Leaderboard
  3. 3Claude Opus 4.794.2% on GPQA Benchmark Leaderboard
  4. 4GPT-5.593.6% on GPQA Benchmark Leaderboard
  5. 5GPT-5.2 Pro93.2% on GPQA Benchmark Leaderboard
  6. 6GPT-5.492.8% on GPQA Benchmark Leaderboard
  7. 7GPT-5.292.4% on GPQA Benchmark Leaderboard
  8. 8Gemini 3 Pro91.9% on GPQA Benchmark Leaderboard
  9. 9Claude Opus 4.691.3% on GPQA Benchmark Leaderboard
  10. 10Kimi K2.690.5% on GPQA Benchmark Leaderboard
  11. 11Gemini 3 Flash90.4% on GPQA Benchmark Leaderboard
  12. 11Qwen3.6 Plus90.4% on GPQA Benchmark Leaderboard
  13. 13DeepSeek-V4-Pro-Max90.1% on GPQA Benchmark Leaderboard
  14. 14Claude Sonnet 4.689.9% on GPQA Benchmark Leaderboard
  15. 15Muse Spark89.5% on GPQA Benchmark Leaderboard
  16. 16Seed 2.0 Pro88.9% on GPQA Benchmark Leaderboard
  17. 17Grok-4 Heavy88.4% on GPQA Benchmark Leaderboard
  18. 17Qwen3.5-397B-A17B88.4% on GPQA Benchmark Leaderboard
  19. 19GPT-5.188.1% on GPQA Benchmark Leaderboard
  20. 19GPT-5.1 Thinking88.1% on GPQA Benchmark Leaderboard
  21. 19GPT-5.1 High88.1% on GPQA Benchmark Leaderboard
  22. 19GPT-5 Medium88.1% on GPQA Benchmark Leaderboard
  23. 19DeepSeek-V4-Flash-Max88.1% on GPQA Benchmark Leaderboard
  24. 19GPT-5.1 Instant88.1% on GPQA Benchmark Leaderboard
  25. 25GPT-5.4 mini88.0% on GPQA Benchmark Leaderboard
  26. 26Qwen3.6-27B87.8% on GPQA Benchmark Leaderboard
  27. 27Kimi K2.587.6% on GPQA Benchmark Leaderboard
  28. 28Grok-487.5% on GPQA Benchmark Leaderboard
  29. 29GPT-5 High87.3% on GPQA Benchmark Leaderboard
  30. 30Claude Opus 4.587.0% on GPQA Benchmark Leaderboard
  31. 31Gemini 3.1 Flash-Lite86.9% on GPQA Benchmark Leaderboard
  32. 32Qwen3.5-122B-A10B86.6% on GPQA Benchmark Leaderboard
  33. 33Gemini 2.5 Pro Preview 06-0586.4% on GPQA Benchmark Leaderboard
  34. 34GLM-5.186.2% on GPQA Benchmark Leaderboard
  35. 35Qwen3.6-35B-A3B86.0% on GPQA Benchmark Leaderboard
  36. 36GPT-585.7% on GPQA Benchmark Leaderboard
  37. 36GLM-4.785.7% on GPQA Benchmark Leaderboard
  38. 36Grok 4 Fast85.7% on GPQA Benchmark Leaderboard
  39. 39GPT-5.5 Instant85.6% on GPQA Benchmark Leaderboard
  40. 40Qwen3.5-27B85.5% on GPQA Benchmark Leaderboard
  41. 41Seed 2.0 Lite85.1% on GPQA Benchmark Leaderboard
  42. 42ERNIE 5.085.0% on GPQA Benchmark Leaderboard
  43. 43Claude 3.7 Sonnet84.8% on GPQA Benchmark Leaderboard
  44. 44Grok-384.6% on GPQA Benchmark Leaderboard
  45. 45Kimi K2-Thinking-090584.5% on GPQA Benchmark Leaderboard
  46. 46Gemma 4 31B84.3% on GPQA Benchmark Leaderboard
  47. 47Qwen3.5-35B-A3B84.2% on GPQA Benchmark Leaderboard
  48. 48ChatGPT-4o Latest84.0% on GPQA Benchmark Leaderboard
  49. 48Grok-3 Mini84.0% on GPQA Benchmark Leaderboard
  50. 50MiMo-V2-Flash83.7% on GPQA Benchmark Leaderboard

Models tracked

Models with gpqa in their evaluation profile.

View task leaderboards →