WideSearch Benchmark Evaluation

Updated 16 August 2025

WideSearch is a benchmark designed to assess agentic systems' ability to exhaustively collect atomic data from real-world, high-volume search queries.
It employs a rigorous five-stage pipeline, including expert gold standard annotation and iterative validation, to ensure data fidelity and completeness.
Experimental results reveal that even advanced agentic architectures struggle with complete table-level success, highlighting critical gaps in current methods.

WideSearch is a benchmark engineered to rigorously evaluate the reliability and completeness of “agentic” information-seeking systems tasked with collecting large-scale atomic data. Distinct from benchmarks focused on deep reasoning or synthesis, WideSearch targets operational broad search—the repetitive, exhaustive acquisition of structured facts from diverse web sources—mirroring high-volume, real user workflows. Its construction and methodology address major gaps in automated agent evaluation for wide-context search, and its results reveal significant limitations in current agentic architectures.

1. Definition and Motivation

WideSearch is designed to specifically benchmark agentic search systems operating at scale across diverse topics and modalities. The central challenge is to exhaustively gather and verify thousands of distinct atomic entities, attributes, or facts in answer to well-defined, high-volume information-seeking queries. These tasks are drawn directly from real-world, temporally robust user demands (e.g., compiling comprehensive tables for graduate programs, listing all governmental organizations, or enumerating firm details across industry sectors). The motivating factor is that the majority of such operational search is bottlenecked by the reliability and scalability of automation—not by conceptual reasoning—making fidelity and completeness in extraction central to evaluation.

2. Dataset Composition and Design Principles

The WideSearch dataset comprises 200 hand-curated tasks (100 English, 100 Chinese) balanced across at least 15 domains and 18 topical areas. Every task pairs a natural language user query $Q$ with a strict table schema $S = \{ C_1, C_2, ..., C_m \}$ that prescribes output columns. Data annotation follows precise rules:

Sourced from authentic, high-traffic user queries.
Temporal invariance (i.e., questions not tied to ephemeral states).
Objective verifiability: every element in the ground-truth table can be checked against public sources.
Tasks are rejected if strong LLMs answer from parametric knowledge (ensuring tool dependence).
Only tasks with required search time $>10$ min and $>10$ distinct web sources are retained.

The gold standard answer for each task is curated by experts, fully populating the table and recording metrics such as time, queries issued, and document count. LaTeX-formatted conceptual diagrams (see paper figures) compare manual vs. agent-based workflows and illustrate the distinction between WideSearch and related benchmarks focused on “DeepSearch” or reasoning.

3. Quality Control and Evaluation Pipeline

WideSearch employs a rigorously enforced five-stage pipeline:

Initial Sourcing and Question Restructuring: Human annotators select and refine real user questions.
Expert Gold Standard Annotation: Experts search, collect, and verify all possible relevant facts, populating target tables and recording performance metrics.
Knowledge Filtering: Candidate queries are filtered by strong LLMs; those answerable by parametric memory are excluded.
Difficulty Pruning: Tasks too simple, or solvable in minimal time or with few sources, are removed.
Iterative Validation: Automated scoring (LLM-as-judge via GPT-4.1 with strict semantic rules) is compared against human expert ratings. Any task with agreement $<$ 95% is revised or excluded.

The pipeline ensures that only challenging, objectively verifiable, and tool-dependent search tasks remain, with each fact amenable to atomic verification.

4. Agentic System Benchmarks and Experimental Results

WideSearch was used to benchmark over 10 agentic search architectures, including single-agent LLM tool-users, multi-agent frameworks, and commercial end-to-end platforms. The critical metric is table-level accuracy: outputs are evaluated as complete tables, where any omission or error constitutes a task failure. Finer-grained metrics, such as item-level F $_1$ scores for individual table cells, are also reported.

Observed results:

Most systems achieved near $0\%$ table-level success, with the highest multi-agent system reaching only $5\%$ .
Individual cell/item-level F $_1$ scores (with retries and aggregation) could reach $80\%$ , but complete table success remained rare.
Solo human annotators reached a $20\%$ success rate—demonstrating the robustness of the benchmark even for skilled manual efforts.
Given unlimited time and cross-validation by multiple human annotators, a success rate of nearly $100\%$ is achievable, reflecting true completeness when operational constraints are relaxed.

A plausible implication is that current agentic systems lack effective decomposition, verification, and completeness mechanisms required for operational reliability in wide-context search.

5. Technical Challenges Revealed

WideSearch exposes several key deficiencies in contemporary agentic search methods:

Query Decomposition: Many agents fail to decompose complex queries into comprehensive sets of sub-queries, thereby omitting necessary attributes and constraints.
Reflection and Refinement: There is minimal iterative adjustment or “reflection” when initial searches are incomplete or suboptimal.
Evidence Attribution: Agents commonly mishandle attribution—retrieving relevant material but failing to validate, cite, or organize evidence for atomic verification.
Hallucination and Recall: Absence of retrieved data triggers fallback to internal knowledge (hallucinations), undermining output reliability.

The operational significance is that bridging these gaps requires algorithmic advances in multi-agent teamwork, parallelized search, pipeline reflection, and rigorous evidence verification reminiscent of collaborative human workflows.

WideSearch differs fundamentally from other search and synthesis benchmarks:

Benchmark	Focus	Outputs Evaluated
BrowseComp-Plus (Chen et al., 8 Aug 2025)	Deep research reasoning	Accurate citations, reasoning chains
HERB (Choubey et al., 29 Jun 2025)	Deep multi-hop retrieval	Complex, interconnected artifacts
WideSearch	Broad operational seeking	Exhaustive, complete fact tables

WideSearch is not designed to assess deep reasoning or synthesis but rather the agent’s ability to operationalize wide-scale, fact-by-fact data collection with no gaps and no hallucinations. Related works in DeepSearch and DeepResearch focus on multi-hop reasoning and source aggregation, while WideSearch enforces strict completeness in atomic output.

7. Public Release and Community Implications

All WideSearch dataset resources, evaluation code, and benchmark results are publicly available at https://widesearch-seed.github.io/. The objective is to facilitate reproducibility, enable researchers to benchmark new systems, and encourage innovation in agentic architectures capable of overcoming the broad-seeking, high-fidelity operational challenges outlined in this framework.

This benchmark represents a rigorous evaluation standard for agent-based information seeking. It demarcates current system limitations and establishes a foundation for developing agents with scalable, reliable, and complete search capabilities suited to practical, real-world workflows.