MMLU-Pro Benchmark Leaderboard

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

Leaderboard

Top 50 models on MMLU-Pro Benchmark Leaderboard (scores from public evaluations).

  1. 1Qwen3.6 Plus88.5% on MMLU-Pro Benchmark Leaderboard
  2. 2MiniMax M2.188.0% on MMLU-Pro Benchmark Leaderboard
  3. 3Qwen3.5-397B-A17B87.8% on MMLU-Pro Benchmark Leaderboard
  4. 4DeepSeek-V4-Pro-Max87.5% on MMLU-Pro Benchmark Leaderboard
  5. 5Kimi K2.587.1% on MMLU-Pro Benchmark Leaderboard
  6. 6ERNIE 5.087.0% on MMLU-Pro Benchmark Leaderboard
  7. 7Qwen3.5-122B-A10B86.7% on MMLU-Pro Benchmark Leaderboard
  8. 8DeepSeek-V4-Flash-Max86.2% on MMLU-Pro Benchmark Leaderboard
  9. 8Qwen3.6-27B86.2% on MMLU-Pro Benchmark Leaderboard
  10. 10Qwen3.5-27B86.1% on MMLU-Pro Benchmark Leaderboard
  11. 11Qwen3.5-35B-A3B85.3% on MMLU-Pro Benchmark Leaderboard
  12. 12Qwen3.6-35B-A3B85.2% on MMLU-Pro Benchmark Leaderboard
  13. 12Gemma 4 31B85.2% on MMLU-Pro Benchmark Leaderboard
  14. 14DeepSeek-V3.2-Exp85.0% on MMLU-Pro Benchmark Leaderboard
  15. 14DeepSeek-R1-052885.0% on MMLU-Pro Benchmark Leaderboard
  16. 14DeepSeek-V3.2 (Thinking)85.0% on MMLU-Pro Benchmark Leaderboard
  17. 14DeepSeek-V3.285.0% on MMLU-Pro Benchmark Leaderboard
  18. 18MiMo-V2-Flash84.9% on MMLU-Pro Benchmark Leaderboard
  19. 19GLM-4.584.6% on MMLU-Pro Benchmark Leaderboard
  20. 19Kimi K2-Thinking-090584.6% on MMLU-Pro Benchmark Leaderboard
  21. 21Qwen3-235B-A22B-Thinking-250784.4% on MMLU-Pro Benchmark Leaderboard
  22. 22GLM-4.784.3% on MMLU-Pro Benchmark Leaderboard
  23. 23K-EXAONE-236B-A23B83.8% on MMLU-Pro Benchmark Leaderboard
  24. 23Qwen3 VL 235B A22B Thinking83.8% on MMLU-Pro Benchmark Leaderboard
  25. 25Nemotron 3 Super (120B A12B)83.7% on MMLU-Pro Benchmark Leaderboard
  26. 26DeepSeek-V3.183.7% on MMLU-Pro Benchmark Leaderboard
  27. 27Qwen3-235B-A22B-Instruct-250783.0% on MMLU-Pro Benchmark Leaderboard
  28. 28Qwen3-Next-80B-A3B-Thinking82.7% on MMLU-Pro Benchmark Leaderboard
  29. 29LongCat-Flash-Chat82.7% on MMLU-Pro Benchmark Leaderboard
  30. 30Gemma 4 26B-A4B82.6% on MMLU-Pro Benchmark Leaderboard
  31. 30LongCat-Flash-Thinking82.6% on MMLU-Pro Benchmark Leaderboard
  32. 32Kimi K2 090582.5% on MMLU-Pro Benchmark Leaderboard
  33. 32Qwen3.5-9B82.5% on MMLU-Pro Benchmark Leaderboard
  34. 34Qwen3 VL 32B Thinking82.1% on MMLU-Pro Benchmark Leaderboard
  35. 35MiniMax M282.0% on MMLU-Pro Benchmark Leaderboard
  36. 36Qwen3 VL 235B A22B Instruct81.8% on MMLU-Pro Benchmark Leaderboard
  37. 37Sarvam-105B81.7% on MMLU-Pro Benchmark Leaderboard
  38. 38GLM-4.5-Air81.4% on MMLU-Pro Benchmark Leaderboard
  39. 39DeepSeek-V3 032481.2% on MMLU-Pro Benchmark Leaderboard
  40. 40MiniMax M1 80K81.1% on MMLU-Pro Benchmark Leaderboard
  41. 40Kimi K2-Instruct-090581.1% on MMLU-Pro Benchmark Leaderboard
  42. 40Kimi K2 Instruct81.1% on MMLU-Pro Benchmark Leaderboard
  43. 43GPT OSS 120B High80.7% on MMLU-Pro Benchmark Leaderboard
  44. 44MiniMax M1 40K80.6% on MMLU-Pro Benchmark Leaderboard
  45. 44Qwen3-Next-80B-A3B-Instruct80.6% on MMLU-Pro Benchmark Leaderboard
  46. 46Qwen3 VL 30B A3B Thinking80.5% on MMLU-Pro Benchmark Leaderboard
  47. 46Llama 4 Maverick80.5% on MMLU-Pro Benchmark Leaderboard
  48. 48Sarvam-30B80.0% on MMLU-Pro Benchmark Leaderboard
  49. 49Qwen3.5-4B79.1% on MMLU-Pro Benchmark Leaderboard
  50. 50Qwen3 VL 32B Instruct78.6% on MMLU-Pro Benchmark Leaderboard

Models tracked

Models with mmlu-pro in their evaluation profile.

View task leaderboards →