Papers
Topics
Authors
Recent
Search
2000 character limit reached

BrowseComp-Plus Benchmark Overview

Updated 21 January 2026
  • BrowseComp-Plus benchmark is a static, curated dataset designed to evaluate deep-research AI agents that integrate retrieval and multi-step reasoning over a fixed corpus.
  • It employs a rigorous two-stage process featuring automated evidence gathering followed by human verification and hard-negative mining to ensure high-quality test cases.
  • The benchmark extends to multimodal scenarios with MM-BrowseComp, highlighting challenges in combining textual and visual evidence for comprehensive evaluation.

BrowseComp-Plus is a fixed-corpus evaluation benchmark developed to advance the fair, transparent, and reproducible assessment of "deep-research" AI agents that combine LLMs with retrieval and reasoning over multiple iteratively issued search queries. Unlike its predecessor BrowseComp, which relies on live, black-box web search APIs, BrowseComp-Plus provides a curated, human-verified set of web documents paired with positive and hard-negative relevance judgments, facilitating controlled experimentation and fine-grained analysis of agent capabilities in open-domain, multi-step research tasks (Chen et al., 8 Aug 2025, Xia et al., 30 Aug 2025). BrowseComp-Plus is further complemented by MM-BrowseComp, which targets multimodal web browsing scenarios and augments the evaluation of deep-research systems to encompass both textual and visual (image/video) reasoning (Li et al., 14 Aug 2025).

1. Origins and Motivation

The BrowseComp-Plus benchmark is motivated by two overarching limitations in prior evaluation frameworks for research agents: (1) fairness and (2) transparency. The original BrowseComp benchmark, widely adopted for measuring end-to-end search-augmented LLM pipelines, is dependent on dynamic web APIs such as Google and Bing. This reliance leads to time-dependent, non-reproducible outcomes across different teams and makes it impossible to disentangle retrieval failures from reasoning weaknesses due to an opaque, ever-changing evidence source. These deficiencies block apples-to-apples comparison and impede componentwise optimization.

BrowseComp-Plus addresses these concerns by introducing a static, human-verified 100,195-document corpus, from which all queries and their supporting or confounding documents are drawn. For each of the 830 final benchmark questions (filtered and verified from 1,266 original BrowseComp cases), every positive evidence document and a comprehensive set of difficult negatives are rigorously annotated, enabling controlled IR-style and end-to-end analysis driven by the Cranfield paradigm (Chen et al., 8 Aug 2025).

2. Dataset Construction and Evidence Curation

BrowseComp-Plus employs a two-staged pipeline to ensure high-quality, verifiable supervision and retrieval challenge:

  • Stage A: Evidence Gathering and Verification For each question–answer pair, automated mining collects candidate (Clue, URL, Evidence) triples using OpenAI o3 with web search enabled. Scraped web documents are parsed and subject to comprehensive human annotation. Annotators highlight justifying spans and confirm the sufficiency of the collected evidence to support the final answer. Documents are also labeled as "gold" if they contain the answer directly or implicitly. Only queries with complete and cross-verified evidence pass to the final set.
  • Stage B: Hard-Negative Mining Each question is decomposed into approximately seven subqueries via GPT-4o. Subqueries are issued to a commercial web API, and the top 100 results are deduplicated, yielding on average 6.1 human-annotated positives, 76.3 hard negatives, and 2.9 gold documents per query (Chen et al., 8 Aug 2025).

The result is a 100,195-document pool representing diverse domains (news, blogs, technical sites) with a median document length of 5,179 words, supporting robust document-level IR benchmarking.

3. Task Definition and Evaluation Protocols

Task Structure

Questions in BrowseComp-Plus are open-domain, multi-step research prompts that cannot be answered by single retrieval or short multi-hop chains. Each requires iterative decomposition into subproblems, breadth via multiple targeted search calls over the corpus, and synthesis of evidence from several sources. Implicitly, they instantiate hierarchical constraint satisfaction problems (HCSPs), where each answer is underpinned by a research tree of at least 3–6 evidence nodes (Xia et al., 30 Aug 2025).

Evaluation Metrics

BrowseComp-Plus supports both end-to-end and IR-oriented assessment:

  • Accuracy (primary):

Accuracy=1N∑i=1N1(y^i=yi)×100%\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\hat{y}_i = y_i)\times 100\%

where N=830N=830 and y^i\hat{y}_i is the predicted answer to query ii.

  • Search Calls:

The mean number of distinct retrieval invocations per query.

  • Recall (Evidence):

Proportion of positive evidence documents successfully retrieved.

  • Citation Metrics:

Coverage, precision, and recall for answer-attributed document IDs with respect to positive/supporting documents.

  • Calibration Error:

Bucketed absolute difference between agent confidence and actual accuracy.

  • Retriever-Only Metrics:

Standard IR measures (Recall@k, nDCG@k, MRR) are computed using evidence and gold document qrels for each query.

Experimental Setup

Agents interact with the corpus via standardized search tools (k=5 per call, 512-token previews), with prompting protocols enforcing citation and evidence attribution (Chen et al., 8 Aug 2025).

4. Empirical Findings and Benchmarks

BrowseComp-Plus reveals significant performance stratification across agent/retriever pairs:

LLM/Agent Retriever Accuracy (%) Recall (%) Search Calls
GPT-5 Qwen3-8B 70.12 78.98 21.74
GPT-5 BM25 55.90 61.70 23.23
o3 Qwen3-8B 63.49 73.24 23.97
o3 BM25 49.28 56.64 25.93
GPT-4.1 Qwen3-8B 35.42 36.89 8.67
GPT-4.1 BM25 14.58 16.42 10.35
Qwen3-32B Qwen3-8B 10.36 7.80 0.94
Search-R1-32B BM25 3.86 2.61 1.78

Denser, reasoning-specialized retrievers (e.g., Qwen3-8B) dramatically enhance both the retrieval recall and end-to-end accuracy compared to sparse BM25, especially in conjunction with frontier LLMs (GPT-5, o3). Open-source models (Qwen3-32B, Search-R1-32B) remain far from state-of-the-art, in many settings issuing too few search calls to leverage the evidence pool (Chen et al., 8 Aug 2025, Xia et al., 30 Aug 2025).

An oracle ablation, in which all positives are supplied up front, demonstrates 93.5% and 83.25% accuracy for GPT-4.1 and Qwen3-32B, respectively. This suggests that the corpus is ultimately sufficient for near-perfect performance, with practical bottlenecks being retrieval precision and integration of evidence.

5. Source of Challenge and Analysis

Benchmark design enforces both breadth and depth of reasoning—each query demands subproblem decomposition and evidence synthesis spanning multiple, sometimes highly diverse, sources. Tighter accuracy–efficiency trade-offs are evident, as high-performing models such as GPT-5 achieve greater accuracy only by increasing the number of search calls (≈ 3×) compared to more constrained models.

Failure analyses identify retrieval as the dominant failure mode. Even with enhanced retrievers, agent underperformance is often traceable to hierarchical research trees with more than six vertices, where models have difficulty maintaining cross-document consistency or integrating weak signals over long reasoning chains (Xia et al., 30 Aug 2025).

Citation-accuracy metrics show that models paired with strong dense retrievers not only answer more accurately but also cite supporting evidence with higher coverage, precision, and recall, an important property for verifiable research agents.

6. Extensions: MM-BrowseComp and Multimodal Reasoning

MM-BrowseComp, also referred to as "BrowseComp-Plus" in the multimodal context, extends the textual deep-research paradigm to multimodal web browsing. It comprises 224 hand-crafted research questions spanning 22 subtasks, with 57% of prompts embedding images and all requiring multimodal (text, image, video) evidence for answer derivation (Li et al., 14 Aug 2025).

A key innovation is the inclusion of a human-authored, irreducible reasoning checklist for each question, formalizing the minimal sequence of retrieval and interpretation operations (categorically: text, image, video) required to derive the answer. This enables the following nuanced metrics:

  • Overall Accuracy (OA): answer matches ground truth.
  • Strict Accuracy (SA): both answer and all checklist items correct.
  • Average Checklist Score (AVG CS): mean fraction of checklist items correctly completed.

Evaluation reveals a sharp drop in performance for state-of-the-art text-only and multimodal agents (o3: 29.02% OA; others <20%), with failures frequently linked to visual hallucination, tool execution failure, and logical errors in multimodal evidence chaining. The findings underscore a notable lack of native multimodal reasoning and robust tool orchestration in current models.

7. Impact and Future Research Directions

BrowseComp-Plus, along with MM-BrowseComp, establishes a reproducible, evidence-grounded benchmark suite for the systematic development and assessment of deep-research agents. The benchmarks enable:

  • Disentangled analysis of retrieval and reasoning components, supporting both IR community needs and LLM-based agent research.
  • Controlled experiments in tool integration, citation granularity, calibration, and efficiency.
  • Direct comparisons between closed-source and open-source systems under identical information and tool constraints.

Open research challenges highlighted include co-optimization of retriever/agent training, transferability of tool use protocols, multi-tool orchestration, context-engineering for multi-step retrieval, and improved retrieval/agent synergy for reasoning over hard evidence queries. The large gap between oracle and contemporary performance highlights substantial headroom, particularly in retrieval augmentation and hierarchical reasoning capacity.

Open access to the BrowseComp-Plus corpus, detailed annotation, and baseline results provides a common platform for further research into verifiable, agentic multi-step research and multimodal information-seeking systems (Chen et al., 8 Aug 2025, Xia et al., 30 Aug 2025, Li et al., 14 Aug 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BrowseComp-Plus Benchmark.