BrowseComp-zh Benchmark Leaderboard

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

Leaderboard

Top 13 models on BrowseComp-zh Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Qwen3.5-397B-A17B	70.3%	—
2	Qwen3.5-122B-A10B	69.9%	—
3	Qwen3.5-35B-A3B	69.5%	—
4	LongCat-Flash-Thinking-2601	69.0%	—
5	GLM-4.7	66.6%	—
6	DeepSeek-V3.2	65.0%	—
6	DeepSeek-V3.2 (Thinking)	65.0%	—
8	Kimi K2-Thinking-0905	62.3%	—
9	Qwen3.5-27B	62.1%	—
10	DeepSeek-V3.1	49.2%	—
11	MiniMax M2	48.5%	—
12	DeepSeek-V3.2-Exp	47.9%	—
13	DeepSeek-R1-0528	35.7%	—

Models tracked

Models with browsecomp-zh in their evaluation profile.

No models linked yet.

View task leaderboards →