Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

BrowseComp Task Benchmark

Updated 17 August 2025
  • BrowseComp Task is a benchmark that rigorously evaluates web browsing agents on persistence, creativity, and strategic fact synthesis.
  • It employs multi-step, reverse-designed questions that resist one-page search results, requiring sustained navigation and information integration.
  • The benchmark features automated semantic grading and compute scaling metrics, offering precise insights into agentic performance.

The BrowseComp Task is a rigorous evaluation benchmark specifically designed to measure the persistence, creativity, and strategic reasoning abilities of web browsing agents. Unlike traditional web search or QA datasets, BrowseComp compels agents to navigate the open web and synthesize hard-to-find, entangled facts by engaging in deep, multi-step browsing episodes. Its design establishes a high bar for agentic research, serving as a practical proxy for internet-scale information seeking and agentic tool use.

1. Benchmark Construction and Design Principles

BrowseComp consists of 1,266 questions, each paired with a short, uniquely verifiable reference answer. Unlike surface-level lookup tasks, every question is reverse-designed: starting from a known fact, annotators pose a query that ensures the answer is not accessible via a single Google search (first-page result checks are performed) and cannot be solved by an average human in under ten minutes. Three quality control steps are strictly applied:

  • Verification that state-of-the-art models (such as GPT-4o, GPT-4.5, and early agentic systems) cannot answer via direct query.
  • Explicit search engine validation to confirm absence from the first page of results.
  • Human trial to ensure question intractability under ten minutes for an experienced annotator.

Each question thus not only requires information retrieval but the persistent navigation and integration of hard-to-access web content, forming a test of true agentic browsing capability.

The answer format remains deliberately simple—short free text—facilitating unambiguous, automatable grading via grader prompt string comparison for semantic equivalence. This distinguishes BrowseComp from benchmarks with laborious or subjective answer validation pipelines.

2. Targeted Agent Capabilities and Competency Dimensions

BrowseComp was engineered to evaluate several critical dimensions for web-based agents:

  • Persistence: Agents must execute long, multi-stage search plans, sustaining memory and focus across prolonged information foraging.
  • Creativity: Standard heuristics or brute-force query permutations are ineffective; agents must reformulate queries, combine partial evidence, and adaptively branch their navigation strategies.
  • Factual Reasoning: Many questions embed entity coupling or require recognition and reconciliation of conflicting cross-page data. Correct answers depend on the agent’s capacity to check, synthesize, and verify composite information from distributed web fragments.
  • Strategic Compute Use: The benchmark is sensitive to test-time compute scaling. The architecture allows measurement of how increasing the number of browsing attempts, or using more effective aggregation strategies (e.g., Best-of-N, Majority Voting), translates into improved performance.

Together, these core capabilities set BrowseComp apart from benchmarks rewarding single-shot factual recall, distinguishing it as a challenge for agents designed to exhibit sustained, tool-mediated, internet-level search intelligence.

3. Evaluation Protocols and Metrics

Answer evaluation in BrowseComp is streamlined. For a given question, an agent returns its predicted answer ApredA_\text{pred}, which is compared to the gold reference answer ArefA_\text{ref}:

Accuracy=Number of correct matches (semantic equivalence)Total number of questions\text{Accuracy} = \frac{\text{Number of correct matches (semantic equivalence)}}{\text{Total number of questions}}

A grader prompt, similar to that employed in Humanity’s Last Exam, is used to compare the predicted and gold strings, encoding non-trivial equivalence (e.g., handling formatting differences, synonyms).

The benchmark supports parallel trial evaluation—multiple browsing attempts with answer selection via sampling or aggregation techniques. LaTeX-encoded plots in the paper illustrate accuracy scaling as a function of test-time compute, providing clear, empirical calibration curves.

Additional model assessment metrics include calibration error, e.g., Expected Calibration Error (ECE), and breakdown tables comparing agent classes.

BrowseComp is distinguished from prior benchmarks in several aspects:

Benchmark Core Focus Difficulty/Novelty Answer Format
TriviaQA, HotpotQA Factoid retrieval / multihop Many questions saturated by LMs Long/freeform
BrowseComp (this) Entangled, creative web-browsing Hard even for trained humans, agentic search required Short/verifiable
WebArena, WebChoreArena Simulated website interaction Focused on navigation, memory, and calculation tasks Program_html or long
ScholarSearch Academic deep retrieval Requires scholarly search, multi-hop Concise, sourced
BrowseComp-ZH Non-English (Chinese) browsing Culturally and infrastructurally complex Short/verifiable

BrowseComp’s requirement of hard-to-find, multi-page synthesis places it beyond the reach of trivial retrieval pipelines and demands abilities not thoroughly assessed elsewhere.

5. Technical Aspects and Implementation Details

Each question in BrowseComp is constructed with the following workflow:

  1. Reverse Design: Annotators start from a unique fact and craft a composite question that cannot be answered via direct retrieval algorithms.
  2. Human-in-the-Loop Filtering: Three-stage validation excludes trivial or ambiguous items.
  3. Grading Pipeline: Answers are graded with a textual grader prompt invoking semantic comparison logic, supporting automated, scalable benchmarking.

The paper’s figures (rendered via tikz/pgfplots) track accuracy against number of parallel queries and aggregation strategies, demonstrating monotonic improvements with scalable compute (log-scale Best-of-N, Majority Vote, Weighted Vote). Detailed tables summarize performance across baseline and browsing-enabled models; calibration error is included for diagnostic rigor.

The benchmark package, documentation, and grading scripts are open sourced at https://github.com/openai/simple-evals.

6. Applications, Impact, and Future Directions

BrowseComp’s principal utility is as a discriminative benchmark for browsing agents that blend tool-use, memory, and strategic, iterative reasoning. Its applications include:

  • Comparative Agent Evaluation: Enables systematic comparison of architectures reliant on internal knowledge, retrieval augmentation, or agentic planning; supports analysis of compute-resource scaling and its effect on solution rate.
  • Agent Training Objective: Given its sensitivity to persistence and creative navigation, BrowseComp can be used both as a final evaluation suite and as a source of curriculum-style supervision for reinforcement learning or chain-of-thought prompting pipelines.
  • Extension to New Languages and Domains: The design has inspired language-specific analogs (BrowseComp-ZH) and is referenced in the context of academic search challenges (ScholarSearch), multimodal extensions (BrowseComp-VL), and fairer/reproducible corpus-benchmarks (BrowseComp-Plus).

The benchmark’s persistent unsolved rate, even by experienced humans, and its resistance to brute-force search, position it as a key testbed for next-generation dynamic, reasoning-capable web agents.

7. Limitations and Open Challenges

BrowseComp intentionally sidesteps certain complexities:

  • It does not aim to reflect the true distribution of end-user web search queries.
  • It focuses on short answer outputs, avoiding challenges in long answer generation or ambiguity resolution.
  • While highly discriminative, it is “incomplete” in the sense that it does not exhaustively measure all possible facets of browsing agent ability (e.g., subjective experience, user-adaptivity).

A key ongoing challenge is developing agents that can demonstrate high-fidelity, generalizable performance on BrowseComp without reliance on excessive test-time compute, and extending such approaches to new modalities, languages, and less controlled domains.


BrowseComp thus stands as a pivotal, rigorously controlled evaluation suite for analyzing and advancing the state of persistent, creative, real-world browsing agents. Its design, metrics, and open-source resources support both benchmarking and methodological innovation in the field of web-based autonomous intelligence (Wei et al., 16 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)