BrowseComp-ZH: Chinese Web Agent Benchmark
- BrowseComp-ZH is a high-difficulty benchmarking suite that rigorously evaluates LLM agents navigating the Chinese web with tasks requiring multi-hop reasoning and information reconciliation.
- It employs 289 reverse-engineered, multi-step questions across 11 domains to test search planning, evidence aggregation, and contextual inference.
- Evaluation protocols like exact-match accuracy and calibration error highlight performance gaps between closed LLMs and agentic search systems in complex, real-world conditions.
BrowseComp-ZH is a high-difficulty benchmarking suite designed to rigorously evaluate the web-browsing, multi-hop reasoning, and information reconciliation abilities of LLM agents operating within the Chinese-language web ecosystem. Motivated by the pervasive limitations of English-centric evaluation and the unique linguistic, infrastructural, and policy challenges of the Chinese web, BrowseComp-ZH introduces multi-step question answering tasks that require comprehensive search, data integration, and contextual inference over diverse Chinese online media.
1. Motivations and Distinctive Challenges
BrowseComp-ZH arises from key gaps in prior LLM-agent research and benchmarking. Existing benchmarks such as BrowseComp focus on English-language web resources (e.g., Wikipedia, IMDb) and do not model the complexities inherent in the Chinese web. Directly translating such benchmarks is fundamentally insufficient due to:
- Fragmented content landscape: The Chinese web is distributed across platforms such as Baidu Baike, Zhihu, academic portals, and government sites, many of which have unique indexing, search APIs, or proprietary data structures.
- Linguistic barriers: Written Chinese text exhibits a lack of explicit word boundaries, frequent use of idiomatic expressions and ellipsis, and highly context-dependent syntax, impeding straightforward keyword matching or naive prompt translation.
- Censorship and content volatility: Dynamic filtering, frequent URL relocation, and access restrictions can obscure or remove critical information, exacerbating data incompleteness and increasing reasoning demands.
BrowseComp-ZH is specifically constructed to target these challenges, isolating the agentic abilities of LLMs in the context of a native Chinese, multi-source, open-web environment (Zhou et al., 27 Apr 2025).
2. Dataset Construction Methodology
2.1 Question Design
BrowseComp-ZH comprises 289 validated questions spanning 11 domains—Film & TV, Art, Geography, Music, History, Medicine, Technology, Sports, Policy & Law, Video Games, and Academic Research. Each question is engineered to require at least two orthogonal reasoning constraints (multi-hop) so that no answer is directly recoverable via a single keyword search.
2.2 Reverse-Engineering Protocol
Questions are generated by reverse-engineering from concise, objectively verifiable answers (dates, numbers, named entities). Annotators construct questions that can only be fulfilled by satisfying composite constraints. For example, a question may require first identifying an institution meeting a categorical filter, then verifying temporal information only available from another independent source.
2.3 Quality Control
Quality assurance is two-staged:
- Stage 1: Difficulty Check—Questions whose answers appear on the first page of any mainstream search engine (Baidu, Bing, Google) are removed.
- Stage 2: Uniqueness Check—State-of-the-art AI agents and human annotators confirm answer singularity by exhaustive search. Any item with multiple possible correct answers is excluded.
The resulting benchmark emphasizes both high difficulty and answer uniqueness, setting it apart from typical web QA datasets (Zhou et al., 27 Apr 2025).
3. Evaluation Protocols and Metrics
Central evaluation measures are:
- Exact-match accuracy (ACC):
reflecting the proportion of queries for which an agent's answer matches the reference exactly (no partial credit).
- Expected Calibration Error (ECE):
where is the sample count in bin , and , are empirical accuracy and mean confidence for each bin.
Agents fall into two paradigms:
- Closed LLMs (using only parametric memory inference)
- Agentic search systems (or “AI-search products”) with explicit browsing and tool-use capacity; typically interacted with by multi-turn search and parsing of result interfaces.
Retrieval strategies are further distinguished by single-shot versus multi-turn iterative search, with planning and evidence reconciliation (“chain-of-search”) critically evaluated (Zhou et al., 27 Apr 2025).
4. Benchmark Results and Comparative Performance
BrowseComp-ZH exposes severe performance gaps in current LLM and agentic systems. Over 20 representative models were benchmarked:
| Model | Category | Reasoning? | Browsing? | ACC (%) |
|---|---|---|---|---|
| DeepSeek-V3 | Open | N | N | 8.7 |
| DeepSeek-R1 | Open | Y | N | 23.2 |
| Qwen2.5-72B | Open | N | N | 6.6 |
| O1 | Closed | Y | N | 29.1 |
| Gemini-2.5-Pro | Closed | Y | N | 27.3 |
| GPT-4o | Closed | N | N | 6.2 |
| DeepResearch | AI Search | — | Y | 42.9 |
| Doubao (Deep) | AI Search | — | Y | 26.0 |
| Perplexity | AI Search | — | Y | 22.6 |
Key findings:
- Most standalone LLMs score <10% ACC.
- Only a few closed-source models without browsing break 20% ACC.
- Multi-turn search-enabled systems achieve a maximum of 42.9% ACC (DeepResearch).
- Simple attachment of tool-use pipelines can reduce agent effectiveness when search planning or evidence reconciliation are suboptimal (e.g., DeepSeek-R1 drops from 23.2% to 7.6% with naive browsing integration).
Single-shot search is consistently inadequate for the benchmark's multi-hop structure; multi-step, planned retrieval is necessary to approach competent performance (Zhou et al., 27 Apr 2025).
5. Insights and Technical Implications
BrowseComp-ZH reveals that effective performance on the Chinese web necessitates sophisticated capabilities beyond simple web retrieval:
- Iterative retrieval strategy: Multi-turn, adaptive search trajectories dramatically increase accuracy compared to single-shot or static search.
- Explicit multi-step reasoning: Pipeline designs integrating stepwise logic, planning, and context tracking can increase accuracy by 10–15 percentage points.
- Evidence integration: The ability to aggregate, cross-reference, and reconcile noisy or partially conflicting data (“information reconciliation”) is critical; failure in this area results in substantial performance degradation.
- Localization: Chinese web agents must be adapted for local platforms and handle fragmentation and censorship; naively ported English-centric pipelines perform poorly.
A plausible implication is that advances in query planning, dynamic search adjustment, and robust aggregation modules are essential for further gains in high-difficulty, non-English web QA. Additionally, censorship-aware infrastructure and up-to-date content crawling will be necessary for long-term reliability (Zhou et al., 27 Apr 2025).
6. Resources, Reproducibility, and Ongoing Extensions
BrowseComp-ZH provides full open access to its data, annotation guidelines, evaluation code, and prompt/response templates:
- Repository: https://github.com/PALIN2018/BrowseComp-ZH
- Inclusions: 289 Chinese questions with English glosses, URLs and reasoning traces, grading scripts, and benchmarking protocols.
This resource enables the research community to standardize evaluation, reproduce current results, and extend the benchmark to related domains or more challenging settings. All instructions for instantiating test runs, confidence calibration computation, and annotation validation are explicitly specified (Zhou et al., 27 Apr 2025).
7. Contextualization and Future Research Trajectories
BrowseComp-ZH situates itself amid a broader movement towards agentic, tool-using LLM evaluation in linguistically diverse, open-web settings. Its construction directly addresses the deepest failure modes observed in prior English-dominant studies and marks a paradigm shift toward robust localization.
Immediate research avenues include:
- Development of advanced agent architectures for adaptive multi-hop search and post-retrieval inference.
- Expansion to include open-web video (see localization steps in “Video-BrowseComp” (Liang et al., 28 Dec 2025)).
- Incorporation of dialectal and regional Chinese data to further stress-test multi-source, low-resource information integration.
BrowseComp-ZH sets a frontier benchmark for Chinese-language web research agents and establishes a template for rigorous, real-world, multi-hop evaluation in complex information environments.