BrowseComp-VL: Deep-Research Agent Benchmark

Updated 13 August 2025

BrowseComp-VL is a benchmark that evaluates deep-research agents by integrating large language models with retrieval tools using a controlled corpus.
It employs a transparent evaluation method with explicit gold evidence and hard negatives to isolate multi-hop reasoning and retrieval precision.
Metrics like accuracy, recall, and calibration error enable a detailed assessment of retrieval effectiveness and evidence citation in agent responses.

BrowseComp-VL denotes the evaluation of deep-research agents—systems that integrate LLMs with retrieval tools—on the BrowseComp(-Plus) benchmark, a rigorously designed suite for measuring persistent, multi-hop reasoning and web information synthesis. The evolution from BrowseComp (Wei et al., 16 Apr 2025) to BrowseComp-Plus (Chen et al., 8 Aug 2025) reflects a transition from live, black-box web environments to fully controlled, transparent, and reproducible evaluation protocols. These benchmarks have become central for disentangling and quantifying the capabilities of agents in deep-research scenarios, where iterative search planning, retrieval, and multi-step reasoning are required for challenging queries.

1. Benchmark Construction and Objectives

BrowseComp-Plus was specifically engineered to address critical limitations inherent to its predecessor, BrowseComp. Previously, evaluation depended on volatile live web APIs, introducing fairness and reproducibility issues. BrowseComp-Plus substitutes the dynamic web with a fixed, human-verified corpus exceeding 100,000 documents. Each question in the benchmark is paired with explicit “gold” evidence documents (containing or supporting the answer) and curated “hard negative” distractors mined from challenging query variations. This structure allows controlled experimentation that isolates the contribution of the retrieval module, supporting a granular assessment of both retrieval precision and agent reasoning.

The benchmark questions themselves derive from a data-driven “fact inversion” process: trainers begin from a known fact and craft queries that require multi-hop internet navigation and synthesis. This design focuses explicitly on persistent browsing and creative search, rather than ambiguous natural language understanding or conversational breadth. The answer format remains short and easily verifiable, but the path to uncover that answer demands tool use and strategic planning comparable to programming competitions for coding agents.

2. Evaluation Methodology and Metrics

Evaluation in BrowseComp-Plus involves a multi-layered metric framework:

Accuracy: Proportion of questions answered correctly; computed simply as $Accuracy = (\text{Correct} / \text{Total}) \times 100\%$ .
Recall: Percentage of human-verified evidence documents retrieved during the agent’s search process.
Search Calls: Average number of retrieval invocations per query, which reflects search efficiency and strategic effectiveness.
Calibration Error: Quantifies the discrepancy between predicted confidence scores and observed accuracy, probing an agent’s self-assessment reliability.
Retrieval Effectiveness: Metrics such as Recall@ $k$ and nDCG@10 compare returned document lists against ground-truth evidence using Cranfield-style evaluation.

The fixed corpus and explicit evidence labels allow for end-to-end assessment and for diagnostic runs where retrieval quality is decoupled from reasoning. Aggregation strategies such as best-of- $N$ sampling and voting schemes are used in high-compute settings, with accuracy shown to scale smoothly as parallelization increases.

3. Fairness and Transparency Enhancements

BrowseComp-Plus is distinguished by its emphasis on fairness and transparency. By fixing the evaluation corpus, researchers eliminate confounding factors due to changing web content and opaque external APIs. All evidence and negatives per query are public, and every agent’s retrieval chain is reproducibly traceable. This setup enables methodologically rigorous comparisons—between open-source models like Search-R1 and Qwen3-32B, dense retrievers (Qwen3-Embedding-8B), and closed-source flagship agents such as GPT-5—where differences in accuracy, citation precision, and retrieval costs are attributable to system design rather than environment variance.

The benchmark’s transparency supports granular error analysis: an agent’s failures can be traced directly to retrieval misses, reasoning mistakes, or tool mismanagement, and oracle ablations provide upper bounds where all gold evidence is furnished directly.

4. Retrievers, Deep-Research Agents, and Performance Scaling

Results documented across BrowseComp-Plus experiments consistently demonstrate that retrieval quality is a limiting factor for deep-research agents. BM25 retrievers yield lower accuracy (e.g., 3.86% for Search-R1 and 55.9% for GPT-5); advanced dense retrievers such as Qwen3-Embedding-8B raise GPT-5’s accuracy to 70.1% and also reduce the number of search calls required. Similar scaling is observed for recall and nDCG metrics—dense retrievers outperform BM25 variants in surfacing necessary gold evidence.

Citation accuracy—measured as the fraction of evidence documents surfaced and correctly linked in agent responses—increases markedly with better retrieval. Context engineering improvements, such as full-document retrieval tools (as opposed to preview-only access), further enhance answer quality and factual correctness, as demonstrated in controlled agent ablations.

Scaling experiments show that performance increases smoothly when agents are allowed more search calls and parallel output aggregation. The upper bound (oracle) experiments indicate that, given perfect retrieval, answer accuracy is dramatically higher, confirming that substantial “headroom” for future improvement exists in retrieval and evidence integration.

5. Practical Implications and Applications

BrowseComp-Plus, through its disentangled evaluation approach, yields actionable insights for the development of deep-research agents:

Retrieval-Reasoning Co-Optimization: Results suggest that optimal deep-research systems require joint design of retrieval mechanisms and reasoning modules. Improvements in one component directly impact overall system performance and efficiency.
Citation and Evidence Integrity: The benchmark’s explicit tracking of evidence documents incentivizes agent architectures that not only produce correct answers but also surface trustworthy and complete supporting documentation.
Context Engineering: Agents that leverage sophisticated context ingestion (e.g., reading full documents rather than snippets) achieve higher accuracy but require careful trade-off analysis regarding compute and cost.
Research Workflows: The reproducibility and interpretability of BrowseComp-Plus position it as an ideal platform for comparative studies, federated retrieval experiments, tool-use generalization, and adaptive retrieval strategies in multi-turn research agent scenarios.

6. Limitations, Future Directions, and Potential Expansions

A plausible implication is that current deep-research agents are still bottlenecked by retriever capability and context handling. The gap between baseline BM25 and advanced dense retrievers, coupled with results from oracle retrieval, emphasizes that substantial gains are possible via improved retrieval quality and evidence selection.

Suggested future directions include:

Joint Retriever-Agent Optimization: Developing retrieval algorithms that are explicitly tuned to maximize downstream reasoning effectiveness, rather than just isolated ranking accuracy.
Generalization Across Retrieval Engines: Conducting cross-domain generalization tests, such as training on one retriever but deploying with another (e.g., BM25 vs. neural retrievers), to evaluate robustness.
Cost-Efficiency Analysis: Incorporating evaluation protocols that account for the computational and financial cost of agent queries, balancing accuracy against resource constraints.
Expanding Corpus Modalities: The benchmark may expand beyond text documents to cover multimodal (image, video, interactive) web content, deepening the test for general agent capabilities.
Federated and Multi-Tool Retrieval: Exploring multi-layered retrieval strategies that mimic commercial search engine architectures, with agents orchestrating queries across heterogeneous sources.

7. Comparative Analysis and Impact

BrowseComp-Plus has set a new standard for quantitatively and qualitatively fair evaluation of deep-research agents undertaking complex web queries. The move to a static, human-verified corpus and explicit gold-standard evidence allows for meaningful comparisons and error diagnosis. The benchmark’s granular metrics, reproducibility, and controlled environment foster insight into retrieval effectiveness, citation precision, and reasoning capacity, underlining its significance for both applied research and development in information-seeking AI.

This methodology and its ecosystem collectively facilitate comprehensive understanding and advancement in the design of agents capable of persistent, strategic, and efficient web browsing. The empirical results—such as the jump from 55.9% to 70.12% accuracy for GPT-5 in end-to-end settings when switching from BM25 to Qwen3-Embedding-8B retrievers—direct research attention to the crucial effects of retrieval choice, toolchain management, and context engineering. BrowseComp-Plus will continue to serve as an authoritative foundation for future improvements in deep-research agent architectures and evaluation protocols (Wei et al., 16 Apr 2025, Chen et al., 8 Aug 2025).