BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese (2504.19314v2)

Published 27 Apr 2025 in cs.CL

Abstract: As LLMs evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art LLMs and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

PDF Abstract

Benchmarking LLMs for Chinese Web Browsing: Insights from the BrowseComp-ZH Paper

The paper "BrowseComp-ZH: Benchmarking Web Browsing Ability of LLMs in Chinese" introduces a novel benchmark designed to assess the web browsing capabilities of LLMs within the Chinese linguistic context. As LLMs transition from static knowledge repositories to dynamic entities capable of utilizing external tools, evaluating their ability to navigate, retrieve, and synthesize information across non-English web platforms becomes imperative.

Challenges in Chinese Web Evaluation

The paper begins by highlighting the complexities inherent in evaluating LLMs on the Chinese web. Unlike the English-centric benchmarks that rely on well-indexed platforms like Wikipedia and IMDb, the Chinese web presents unique challenges due to its fragmented structure, heterogeneity, and linguistic nuances. These challenges include inconsistent naming conventions, implicit referents, idiomatic expressions, and the vast array of platforms such as Baidu, Zhihu, and government portals. Hence, merely translating English benchmarks is insufficient for assessing LLMs in the Chinese context.

The Development of BrowseComp-ZH

To address these challenges, the authors developed BrowseComp-ZH, a benchmark that features 289 meticulously crafted multi-hop questions across 11 domains, including technology, law, and medicine. Each question originates from a reverse-engineered perspective, starting from a definitive answer and constructing queries that require complex multi-step reasoning and retrieval. The benchmark employs a rigorous two-stage quality control process to ensure high difficulty and answer uniqueness, establishing a stringent test for LLMs operating in the Chinese information landscape.

Benchmark Results and Analysis

The paper benchmarks over 20 state-of-the-art LLMs and search systems using BrowseComp-ZH. Despite the advanced conversational abilities of these models, they largely struggle with the benchmark. Noteworthy results include OpenAI's DeepResearch achieving a maximum accuracy of 42.9%, while many others fall below 10%. This data underscores the demanding nature of the benchmark, which not only tests retrieval strategies but also assesses the models' reasoning and information reconciliation capacities.

Implications and Future Directions

The findings from BrowseComp-ZH have significant practical and theoretical implications. Practically, they highlight the necessity for native benchmarks in non-English contexts, providing a foundation for evaluating LLMs globally. Theoretically, the challenges faced by current LLMs in effectively integrating retrieved information with internal knowledge suggest a pivotal area for research and development.

Further advancements could focus on enhancing LLM reasoning capabilities and their ability to align retrieved data with existing representations. Future research may explore iterative retrieval strategies and post-retrieval reasoning mechanisms to improve browsing accuracy. Additionally, expanding the dataset and diversifying question types could lead to more comprehensive evaluations.

Conclusion

In conclusion, BrowseComp-ZH sets a high bar for LLMs in terms of browsing capability within the Chinese web environment, offering critical insights into their limitations and areas for improvement. This benchmark serves as an essential tool for driving the future development of LLMs, ensuring they are equipped to handle the intricate demands of diverse information ecosystems worldwide. As AI continues to advance, the imperative to create robust, localized benchmarks will remain integral to understanding and enhancing LLM functionalities across global contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (16)

Peilin Zhou (34 papers)
Bruce Leon (1 paper)
Xiang Ying (6 papers)
Can Zhang (69 papers)
Yifan Shao (6 papers)
Qichen Ye (12 papers)
Dading Chong (19 papers)
Zhiling Jin (3 papers)
Chenxuan Xie (10 papers)
Meng Cao (107 papers)
Yuxin Gu (2 papers)
Sixin Hong (1 paper)
Jing Ren (90 papers)
Jian Chen (257 papers)
Chao Liu (358 papers)
Yining Hua (23 papers)