BrowseComp-zh Benchmark Leaderboard
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
Leaderboard
Top 13 models on BrowseComp-zh Benchmark Leaderboard (scores from public evaluations).
- 1Qwen3.5-397B-A17B70.3% on BrowseComp-zh Benchmark Leaderboard
- 2Qwen3.5-122B-A10B69.9% on BrowseComp-zh Benchmark Leaderboard
- 3Qwen3.5-35B-A3B69.5% on BrowseComp-zh Benchmark Leaderboard
- 4LongCat-Flash-Thinking-260169.0% on BrowseComp-zh Benchmark Leaderboard
- 5GLM-4.766.6% on BrowseComp-zh Benchmark Leaderboard
- 6DeepSeek-V3.265.0% on BrowseComp-zh Benchmark Leaderboard
- 6DeepSeek-V3.2 (Thinking)65.0% on BrowseComp-zh Benchmark Leaderboard
- 8Kimi K2-Thinking-090562.3% on BrowseComp-zh Benchmark Leaderboard
- 9Qwen3.5-27B62.1% on BrowseComp-zh Benchmark Leaderboard
- 10DeepSeek-V3.149.2% on BrowseComp-zh Benchmark Leaderboard
- 11MiniMax M248.5% on BrowseComp-zh Benchmark Leaderboard
- 12DeepSeek-V3.2-Exp47.9% on BrowseComp-zh Benchmark Leaderboard
- 13DeepSeek-R1-052835.7% on BrowseComp-zh Benchmark Leaderboard
Models tracked
Models with browsecomp-zh in their evaluation profile.
- No models linked yet.