BrowseComp Benchmark Leaderboard

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

Leaderboard

Top 46 models on BrowseComp Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	GPT-5.5 Pro	90.1%	—
2	Claude Mythos Preview	86.9%	—
3	Kimi K2.6	86.3%	—
4	Gemini 3.1 Pro	85.9%	—
5	GPT-5.5	84.4%	—
6	Claude Opus 4.6	84.0%	—
7	DeepSeek-V4-Pro-Max	83.4%	—
8	GPT-5.4	82.7%	—
9	Claude Opus 4.7	79.3%	—
9	GLM-5.1	79.3%	—
11	GPT-5.2 Pro	77.9%	—
12	Seed 2.0 Pro	77.3%	—
13	MiniMax M2.5	76.3%	—
14	GLM-5	75.9%	—
15	Kimi K2.5	74.9%	—
16	Claude Sonnet 4.6	74.7%	—
17	DeepSeek-V4-Flash-Max	73.2%	—
18	Qwen3.5-397B-A17B	69.0%	—
18	Step-3.5-Flash	69.0%	—
20	GPT-5.2	65.8%	—
21	Qwen3.5-122B-A10B	63.8%	—
22	MiniMax M2.1	62.0%	—
23	Qwen3.5-35B-A3B	61.0%	—
23	Qwen3.5-27B	61.0%	—
25	Kimi K2-Thinking-0905	60.2%	—
26	MiMo-V2-Flash	58.3%	—
27	LongCat-Flash-Thinking-2601	56.6%	—
28	GPT-5	54.9%	—
29	GLM-4.7	52.0%	—
30	o4-mini	51.5%	—
31	DeepSeek-V3.2	51.4%	—
31	DeepSeek-V3.2 (Thinking)	51.4%	—
33	o3	49.7%	—
34	Sarvam-105B	49.5%	—
35	Mistral Medium 3.5	48.6%	—
36	GLM-4.6	45.1%	—
37	Grok 4 Fast	44.9%	—
38	MiniMax M2	44.0%	—
39	GLM-4.7-Flash	42.8%	—
40	DeepSeek-V3.2-Exp	40.1%	—
41	Sarvam-30B	35.5%	—
42	Nemotron 3 Super (120B A12B)	31.3%	—
43	DeepSeek-V3.1	30.0%	—
44	GLM-4.5	26.4%	—
45	GLM-4.5-Air	21.3%	—
46	DeepSeek-R1-0528	8.9%	—

Models tracked

Models with browsecomp in their evaluation profile.

No models linked yet.

View task leaderboards →