BrowseComp Benchmark Leaderboard

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

Leaderboard

Top 46 models on BrowseComp Benchmark Leaderboard (scores from public evaluations).

  1. 1GPT-5.5 Pro90.1% on BrowseComp Benchmark Leaderboard
  2. 2Claude Mythos Preview86.9% on BrowseComp Benchmark Leaderboard
  3. 3Kimi K2.686.3% on BrowseComp Benchmark Leaderboard
  4. 4Gemini 3.1 Pro85.9% on BrowseComp Benchmark Leaderboard
  5. 5GPT-5.584.4% on BrowseComp Benchmark Leaderboard
  6. 6Claude Opus 4.684.0% on BrowseComp Benchmark Leaderboard
  7. 7DeepSeek-V4-Pro-Max83.4% on BrowseComp Benchmark Leaderboard
  8. 8GPT-5.482.7% on BrowseComp Benchmark Leaderboard
  9. 9Claude Opus 4.779.3% on BrowseComp Benchmark Leaderboard
  10. 9GLM-5.179.3% on BrowseComp Benchmark Leaderboard
  11. 11GPT-5.2 Pro77.9% on BrowseComp Benchmark Leaderboard
  12. 12Seed 2.0 Pro77.3% on BrowseComp Benchmark Leaderboard
  13. 13MiniMax M2.576.3% on BrowseComp Benchmark Leaderboard
  14. 14GLM-575.9% on BrowseComp Benchmark Leaderboard
  15. 15Kimi K2.574.9% on BrowseComp Benchmark Leaderboard
  16. 16Claude Sonnet 4.674.7% on BrowseComp Benchmark Leaderboard
  17. 17DeepSeek-V4-Flash-Max73.2% on BrowseComp Benchmark Leaderboard
  18. 18Qwen3.5-397B-A17B69.0% on BrowseComp Benchmark Leaderboard
  19. 18Step-3.5-Flash69.0% on BrowseComp Benchmark Leaderboard
  20. 20GPT-5.265.8% on BrowseComp Benchmark Leaderboard
  21. 21Qwen3.5-122B-A10B63.8% on BrowseComp Benchmark Leaderboard
  22. 22MiniMax M2.162.0% on BrowseComp Benchmark Leaderboard
  23. 23Qwen3.5-35B-A3B61.0% on BrowseComp Benchmark Leaderboard
  24. 23Qwen3.5-27B61.0% on BrowseComp Benchmark Leaderboard
  25. 25Kimi K2-Thinking-090560.2% on BrowseComp Benchmark Leaderboard
  26. 26MiMo-V2-Flash58.3% on BrowseComp Benchmark Leaderboard
  27. 27LongCat-Flash-Thinking-260156.6% on BrowseComp Benchmark Leaderboard
  28. 28GPT-554.9% on BrowseComp Benchmark Leaderboard
  29. 29GLM-4.752.0% on BrowseComp Benchmark Leaderboard
  30. 30o4-mini51.5% on BrowseComp Benchmark Leaderboard
  31. 31DeepSeek-V3.251.4% on BrowseComp Benchmark Leaderboard
  32. 31DeepSeek-V3.2 (Thinking)51.4% on BrowseComp Benchmark Leaderboard
  33. 33o349.7% on BrowseComp Benchmark Leaderboard
  34. 34Sarvam-105B49.5% on BrowseComp Benchmark Leaderboard
  35. 35Mistral Medium 3.548.6% on BrowseComp Benchmark Leaderboard
  36. 36GLM-4.645.1% on BrowseComp Benchmark Leaderboard
  37. 37Grok 4 Fast44.9% on BrowseComp Benchmark Leaderboard
  38. 38MiniMax M244.0% on BrowseComp Benchmark Leaderboard
  39. 39GLM-4.7-Flash42.8% on BrowseComp Benchmark Leaderboard
  40. 40DeepSeek-V3.2-Exp40.1% on BrowseComp Benchmark Leaderboard
  41. 41Sarvam-30B35.5% on BrowseComp Benchmark Leaderboard
  42. 42Nemotron 3 Super (120B A12B)31.3% on BrowseComp Benchmark Leaderboard
  43. 43DeepSeek-V3.130.0% on BrowseComp Benchmark Leaderboard
  44. 44GLM-4.526.4% on BrowseComp Benchmark Leaderboard
  45. 45GLM-4.5-Air21.3% on BrowseComp Benchmark Leaderboard
  46. 46DeepSeek-R1-05288.9% on BrowseComp Benchmark Leaderboard

Models tracked

Models with browsecomp in their evaluation profile.

  • No models linked yet.

View task leaderboards →