- The paper introduces WideSearch, a benchmark for evaluating LLM-powered search agents on large-scale information gathering tasks.
- It employs a comprehensive five-stage pipeline with metrics like human annotation time and multi-agent performance to ensure reliability.
- Results highlight low success rates and emphasize the need for advanced multi-agent architectures and iterative refinement strategies.
WideSearch: Benchmarking Agentic Broad Info-Seeking
Motivation and Problem Definition
WideSearch addresses a critical gap in the evaluation of LLM-powered search agents: their ability to perform large-scale, high-fidelity information gathering across diverse domains. Unlike DeepSearch (focused on locating specific, hard-to-find facts) and DeepResearch (centered on synthesizing complex narratives), WideSearch targets tasks characterized by operational scale rather than cognitive complexity. These tasks require agents to exhaustively collect atomic information for a set of entities and organize it into structured outputs, emulating real-world scenarios such as compiling sector-wide financial data or aggregating academic program requirements.
Figure 1: Conceptual comparison of manual and agent-based approaches for WideSearch tasks, highlighting operational workflows and failure modes.
Figure 2: Overview and comparison of DeepSearch, DeepResearch, and WideSearch paradigms across core tasks and evaluation methods.
Benchmark Construction and Methodology
WideSearch comprises 200 manually curated tasks (100 English, 100 Chinese) spanning 18 topics, each designed to require extensive, verifiable, and publicly accessible information gathering. The benchmark construction follows a rigorous five-stage pipeline:
- Sourcing and Refinement: Real user queries are selected and refined for clarity and breadth.
- Gold Standard Annotation: Human annotators exhaustively search and compile ground-truth answers, recording metrics such as completion time and number of web pages consulted.
- Parametric Knowledge Filtering: Tasks solvable by LLMs without external tools are excluded.
- Difficulty-Based Pruning: Only tasks requiring significant effort (≥10 minutes, ≥10 web pages) are retained.
- Iterative Validation: Automated and human evaluations are aligned to ensure scoring reliability.
Figure 3: Integrated data pipeline for WideSearch, detailing curation, validation, and automated evaluation stages.
The resulting tasks demand substantial procedural effort, with human annotators averaging 2.33 hours and consulting 44.1 unique web pages per task. The answer data volume per task ranges from hundreds to thousands of atomic facts, with an average of 2001.2 for Chinese and 938.6 for English tasks.
Figure 4: Distribution of 18 topics across the 200 WideSearch tasks, ensuring broad domain coverage.
Figure 5: Statistical distributions of completion time and breadth of research for Chinese and English tasks.
Evaluation Framework
WideSearch employs a hybrid automated evaluation pipeline combining deterministic rule-based checks and LLM-as-a-judge semantic scoring. Each agent output is parsed, normalized, and aligned with ground-truth tables using primary keys. Evaluation metrics include:
- Success Rate (SR): Binary, all-or-nothing measure of perfect table match.
- Row-level F1 Score: Measures precision and recall at the row (entity) level.
- Item-level F1 Score: Assesses fine-grained accuracy at the cell (atomic fact) level.
Multiple runs per task are aggregated using Avg@N, Pass@N, and Max@N strategies to capture both average and peak agent performance.
Experimental Results
WideSearch benchmarks over 10 state-of-the-art agentic search systems, including single-agent, multi-agent, and commercial end-to-end frameworks. Key findings include:
Error Analysis
Systematic analysis reveals four primary advanced agentic failure modes:
- Incomplete Query Decomposition: Agents fail to generate comprehensive sub-queries, missing key constraints or attributes.
Figure 7: Example of incomplete query decomposition—agent omits necessary sub-queries for required details.
- Lack of Reflection and Iterative Refinement: Agents do not adapt search strategies after initial failures, often abandoning tasks prematurely.
Figure 8: Example of lack of reflection—agent fails to refine search after receiving aggregated data.
- Failure in Evidence Utilization: Agents misattribute or misinterpret retrieved evidence, leading to incorrect outputs.
Figure 9: Example of evidence utilization failure—agent misattributes GPA requirement from wrong university.
- Knowledge Hallucination and Factual Inconsistency: Agents fabricate facts when external information is unavailable, resulting in factual errors.
Figure 10: Example of knowledge hallucination—agent invents entrance fee when no data is available.
Basic failure modes include tool invocation errors, output formatting errors, context length exceedance, and response refusals.
Test-Time Scaling and Human Ceiling
Increasing the number of agent attempts (up to 128) improves item-level F1 scores but does not significantly raise table-level SR, which remains below 20%. This demonstrates that while individual fact retrieval is tractable, achieving exhaustive completeness and accuracy at scale is exceptionally difficult. The annotation of ground-truth tables itself requires multiple rounds of human cross-validation.
Implications and Future Directions
WideSearch exposes fundamental limitations in current LLM-agent architectures for broad information-seeking tasks. The primary bottleneck is not search capability per se, but the lack of advanced agentic skills: comprehensive planning, dynamic reflection, and rigorous evidence grounding. The benchmark sets a high bar for agent reliability, with strict success criteria that mirror real-world requirements for exhaustive and error-free data integration.
The results suggest that future progress will depend on:
- Sophisticated Multi-Agent Architectures: Parallel search and cross-validation, mimicking collaborative human workflows, are essential for scaling reliability.
- Enhanced Planning and Reflection Mechanisms: Agents must dynamically decompose queries and iteratively refine strategies in response to partial failures.
- Robust Evidence Attribution: Strict grounding in external sources is necessary to prevent hallucinations and misattributions.
WideSearch provides a robust, objective testbed for driving research in these directions and for benchmarking future agentic systems.
Conclusion
WideSearch establishes a new standard for evaluating LLM-powered search agents on large-scale, high-fidelity information gathering tasks. The benchmark reveals that current systems, including advanced multi-agent frameworks and commercial solutions, are fundamentally challenged by the demands of completeness and accuracy at scale. The core deficiencies lie in advanced agentic capabilities rather than basic search or reasoning. WideSearch will serve as a critical resource for the development and assessment of next-generation agentic architectures, with multi-agent collaboration and dynamic planning identified as key avenues for future research.