BrowseComp Long Context 256k Benchmark Leaderboard

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

Leaderboard

Top 2 models on BrowseComp Long Context 256k Benchmark Leaderboard (scores from public evaluations).

1GPT-5.289.8% on BrowseComp Long Context 256k Benchmark Leaderboard
2GPT-588.8% on BrowseComp Long Context 256k Benchmark Leaderboard

Rank	Model	Score	Lab
1	GPT-5.2	89.8%	—
2	GPT-5	88.8%	—

Models tracked

Models with browsecomp-long-256k in their evaluation profile.

No models linked yet.

View task leaderboards →