BrowseComp-zh Benchmark Leaderboard

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

Leaderboard

Top 13 models on BrowseComp-zh Benchmark Leaderboard (scores from public evaluations).

Models tracked

Models with browsecomp-zh in their evaluation profile.

  • No models linked yet.

View task leaderboards →