BrowseComp Long Context 128k Benchmark Leaderboard
A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.
Leaderboard
Top 5 models on BrowseComp Long Context 128k Benchmark Leaderboard (scores from public evaluations).
- 1GPT-5.292.0% on BrowseComp Long Context 128k Benchmark Leaderboard
- 2GPT-5.190.0% on BrowseComp Long Context 128k Benchmark Leaderboard
- 2GPT-5.1 Instant90.0% on BrowseComp Long Context 128k Benchmark Leaderboard
- 2GPT-5.1 Thinking90.0% on BrowseComp Long Context 128k Benchmark Leaderboard
- 2GPT-590.0% on BrowseComp Long Context 128k Benchmark Leaderboard
Models tracked
Models with browsecomp-long-128k in their evaluation profile.
- No models linked yet.