BrowseComp-ZH: Chinese Web Agent Benchmark

Updated 23 March 2026

BrowseComp-ZH is a high-difficulty benchmarking suite that rigorously evaluates LLM agents navigating the Chinese web with tasks requiring multi-hop reasoning and information reconciliation.
It employs 289 reverse-engineered, multi-step questions across 11 domains to test search planning, evidence aggregation, and contextual inference.
Evaluation protocols like exact-match accuracy and calibration error highlight performance gaps between closed LLMs and agentic search systems in complex, real-world conditions.

BrowseComp-ZH is a high-difficulty benchmarking suite designed to rigorously evaluate the web-browsing, multi-hop reasoning, and information reconciliation abilities of LLM agents operating within the Chinese-language web ecosystem. Motivated by the pervasive limitations of English-centric evaluation and the unique linguistic, infrastructural, and policy challenges of the Chinese web, BrowseComp-ZH introduces multi-step question answering tasks that require comprehensive search, data integration, and contextual inference over diverse Chinese online media.

1. Motivations and Distinctive Challenges

BrowseComp-ZH arises from key gaps in prior LLM-agent research and benchmarking. Existing benchmarks such as BrowseComp focus on English-language web resources (e.g., Wikipedia, IMDb) and do not model the complexities inherent in the Chinese web. Directly translating such benchmarks is fundamentally insufficient due to:

Fragmented content landscape: The Chinese web is distributed across platforms such as Baidu Baike, Zhihu, academic portals, and government sites, many of which have unique indexing, search APIs, or proprietary data structures.
Linguistic barriers: Written Chinese text exhibits a lack of explicit word boundaries, frequent use of idiomatic expressions and ellipsis, and highly context-dependent syntax, impeding straightforward keyword matching or naive prompt translation.
Censorship and content volatility: Dynamic filtering, frequent URL relocation, and access restrictions can obscure or remove critical information, exacerbating data incompleteness and increasing reasoning demands.

BrowseComp-ZH is specifically constructed to target these challenges, isolating the agentic abilities of LLMs in the context of a native Chinese, multi-source, open-web environment (Zhou et al., 27 Apr 2025).

2. Dataset Construction Methodology

2.1 Question Design

BrowseComp-ZH comprises 289 validated questions spanning 11 domains—Film & TV, Art, Geography, Music, History, Medicine, Technology, Sports, Policy & Law, Video Games, and Academic Research. Each question is engineered to require at least two orthogonal reasoning constraints (multi-hop) so that no answer is directly recoverable via a single keyword search.

2.2 Reverse-Engineering Protocol

Questions are generated by reverse-engineering from concise, objectively verifiable answers (dates, numbers, named entities). Annotators construct questions that can only be fulfilled by satisfying composite constraints. For example, a question may require first identifying an institution meeting a categorical filter, then verifying temporal information only available from another independent source.

2.3 Quality Control

Quality assurance is two-staged:

Stage 1: Difficulty Check—Questions whose answers appear on the first page of any mainstream search engine (Baidu, Bing, Google) are removed.
Stage 2: Uniqueness Check—State-of-the-art AI agents and human annotators confirm answer singularity by exhaustive search. Any item with multiple possible correct answers is excluded.

The resulting benchmark emphasizes both high difficulty and answer uniqueness, setting it apart from typical web QA datasets (Zhou et al., 27 Apr 2025).

3. Evaluation Protocols and Metrics

Central evaluation measures are:

Exact-match accuracy (ACC):

$\mathrm{ACC} = \frac{\#\text{correct answers}}{\#\text{total questions}}$

reflecting the proportion of queries for which an agent's answer matches the reference exactly (no partial credit).

Expected Calibration Error (ECE):

$\mathrm{ECE} = \sum_{i=1}^B \frac{n_i}{N} \bigl|acc(i) - conf(i)\bigr|$

where $n_i$ is the sample count in bin $i$ , and $acc(\cdot)$ , $conf(\cdot)$ are empirical accuracy and mean confidence for each bin.

Agents fall into two paradigms:

Closed LLMs (using only parametric memory inference)
Agentic search systems (or “AI-search products”) with explicit browsing and tool-use capacity; typically interacted with by multi-turn search and parsing of result interfaces.

Retrieval strategies are further distinguished by single-shot versus multi-turn iterative search, with planning and evidence reconciliation (“chain-of-search”) critically evaluated (Zhou et al., 27 Apr 2025).

4. Benchmark Results and Comparative Performance

BrowseComp-ZH exposes severe performance gaps in current LLM and agentic systems. Over 20 representative models were benchmarked:

Model	Category	Reasoning?	Browsing?	ACC (%)
DeepSeek-V3	Open	N	N	8.7
DeepSeek-R1	Open	Y	N	23.2
Qwen2.5-72B	Open	N	N	6.6
O1	Closed	Y	N	29.1
Gemini-2.5-Pro	Closed	Y	N	27.3
GPT-4o	Closed	N	N	6.2
DeepResearch	AI Search	—	Y	42.9
Doubao (Deep)	AI Search	—	Y	26.0
Perplexity	AI Search	—	Y	22.6

Key findings:

Most standalone LLMs score <10% ACC.
Only a few closed-source models without browsing break 20% ACC.
Multi-turn search-enabled systems achieve a maximum of 42.9% ACC (DeepResearch).
Simple attachment of tool-use pipelines can reduce agent effectiveness when search planning or evidence reconciliation are suboptimal (e.g., DeepSeek-R1 drops from 23.2% to 7.6% with naive browsing integration).

Single-shot search is consistently inadequate for the benchmark's multi-hop structure; multi-step, planned retrieval is necessary to approach competent performance (Zhou et al., 27 Apr 2025).

5. Insights and Technical Implications

BrowseComp-ZH reveals that effective performance on the Chinese web necessitates sophisticated capabilities beyond simple web retrieval:

Iterative retrieval strategy: Multi-turn, adaptive search trajectories dramatically increase accuracy compared to single-shot or static search.
Explicit multi-step reasoning: Pipeline designs integrating stepwise logic, planning, and context tracking can increase accuracy by 10–15 percentage points.
Evidence integration: The ability to aggregate, cross-reference, and reconcile noisy or partially conflicting data (“information reconciliation”) is critical; failure in this area results in substantial performance degradation.
Localization: Chinese web agents must be adapted for local platforms and handle fragmentation and censorship; naively ported English-centric pipelines perform poorly.

A plausible implication is that advances in query planning, dynamic search adjustment, and robust aggregation modules are essential for further gains in high-difficulty, non-English web QA. Additionally, censorship-aware infrastructure and up-to-date content crawling will be necessary for long-term reliability (Zhou et al., 27 Apr 2025).

6. Resources, Reproducibility, and Ongoing Extensions

BrowseComp-ZH provides full open access to its data, annotation guidelines, evaluation code, and prompt/response templates:

Repository: https://github.com/PALIN2018/BrowseComp-ZH
Inclusions: 289 Chinese questions with English glosses, URLs and reasoning traces, grading scripts, and benchmarking protocols.

This resource enables the research community to standardize evaluation, reproduce current results, and extend the benchmark to related domains or more challenging settings. All instructions for instantiating test runs, confidence calibration computation, and annotation validation are explicitly specified (Zhou et al., 27 Apr 2025).

7. Contextualization and Future Research Trajectories

BrowseComp-ZH situates itself amid a broader movement towards agentic, tool-using LLM evaluation in linguistically diverse, open-web settings. Its construction directly addresses the deepest failure modes observed in prior English-dominant studies and marks a paradigm shift toward robust localization.

Immediate research avenues include:

Development of advanced agent architectures for adaptive multi-hop search and post-retrieval inference.
Expansion to include open-web video (see localization steps in “Video-BrowseComp” (Liang et al., 28 Dec 2025)).
Incorporation of dialectal and regional Chinese data to further stress-test multi-source, low-resource information integration.

BrowseComp-ZH sets a frontier benchmark for Chinese-language web research agents and establishes a template for rigorous, real-world, multi-hop evaluation in complex information environments.

Markdown Report Issue Upgrade to Chat

References (2)

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese (2025)

Video-BrowseComp: Benchmarking Agentic Video Research on Open Web (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BrowseComp-ZH.

BrowseComp-ZH: Chinese Web Agent Benchmark

1. Motivations and Distinctive Challenges

2. Dataset Construction Methodology

2.1 Question Design

2.2 Reverse-Engineering Protocol

2.3 Quality Control

3. Evaluation Protocols and Metrics

4. Benchmark Results and Comparative Performance

5. Insights and Technical Implications

6. Resources, Reproducibility, and Ongoing Extensions

7. Contextualization and Future Research Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BrowseComp-ZH: Chinese Web Agent Benchmark

1. Motivations and Distinctive Challenges

2. Dataset Construction Methodology

2.1 Question Design

2.2 Reverse-Engineering Protocol

2.3 Quality Control

3. Evaluation Protocols and Metrics

4. Benchmark Results and Comparative Performance

5. Insights and Technical Implications

6. Resources, Reproducibility, and Ongoing Extensions

7. Contextualization and Future Research Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research