Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
21 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
230 tokens/sec
2000 character limit reached

WideSearch: Benchmarking Agentic Broad Info-Seeking (2508.07999v1)

Published 11 Aug 2025 in cs.CL

Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of LLMs, automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

Summary

  • The paper introduces WideSearch, a benchmark for evaluating LLM-powered search agents on large-scale information gathering tasks.
  • It employs a comprehensive five-stage pipeline with metrics like human annotation time and multi-agent performance to ensure reliability.
  • Results highlight low success rates and emphasize the need for advanced multi-agent architectures and iterative refinement strategies.

WideSearch: Benchmarking Agentic Broad Info-Seeking

Motivation and Problem Definition

WideSearch addresses a critical gap in the evaluation of LLM-powered search agents: their ability to perform large-scale, high-fidelity information gathering across diverse domains. Unlike DeepSearch (focused on locating specific, hard-to-find facts) and DeepResearch (centered on synthesizing complex narratives), WideSearch targets tasks characterized by operational scale rather than cognitive complexity. These tasks require agents to exhaustively collect atomic information for a set of entities and organize it into structured outputs, emulating real-world scenarios such as compiling sector-wide financial data or aggregating academic program requirements. Figure 1

Figure 1: Conceptual comparison of manual and agent-based approaches for WideSearch tasks, highlighting operational workflows and failure modes.

Figure 2

Figure 2: Overview and comparison of DeepSearch, DeepResearch, and WideSearch paradigms across core tasks and evaluation methods.

Benchmark Construction and Methodology

WideSearch comprises 200 manually curated tasks (100 English, 100 Chinese) spanning 18 topics, each designed to require extensive, verifiable, and publicly accessible information gathering. The benchmark construction follows a rigorous five-stage pipeline:

  1. Sourcing and Refinement: Real user queries are selected and refined for clarity and breadth.
  2. Gold Standard Annotation: Human annotators exhaustively search and compile ground-truth answers, recording metrics such as completion time and number of web pages consulted.
  3. Parametric Knowledge Filtering: Tasks solvable by LLMs without external tools are excluded.
  4. Difficulty-Based Pruning: Only tasks requiring significant effort (≥10 minutes, ≥10 web pages) are retained.
  5. Iterative Validation: Automated and human evaluations are aligned to ensure scoring reliability. Figure 3

    Figure 3: Integrated data pipeline for WideSearch, detailing curation, validation, and automated evaluation stages.

The resulting tasks demand substantial procedural effort, with human annotators averaging 2.33 hours and consulting 44.1 unique web pages per task. The answer data volume per task ranges from hundreds to thousands of atomic facts, with an average of 2001.2 for Chinese and 938.6 for English tasks. Figure 4

Figure 4: Distribution of 18 topics across the 200 WideSearch tasks, ensuring broad domain coverage.

Figure 5

Figure 5

Figure 5: Statistical distributions of completion time and breadth of research for Chinese and English tasks.

Evaluation Framework

WideSearch employs a hybrid automated evaluation pipeline combining deterministic rule-based checks and LLM-as-a-judge semantic scoring. Each agent output is parsed, normalized, and aligned with ground-truth tables using primary keys. Evaluation metrics include:

  • Success Rate (SR): Binary, all-or-nothing measure of perfect table match.
  • Row-level F1 Score: Measures precision and recall at the row (entity) level.
  • Item-level F1 Score: Assesses fine-grained accuracy at the cell (atomic fact) level.

Multiple runs per task are aggregated using Avg@N, Pass@N, and Max@N strategies to capture both average and peak agent performance.

Experimental Results

WideSearch benchmarks over 10 state-of-the-art agentic search systems, including single-agent, multi-agent, and commercial end-to-end frameworks. Key findings include:

  • Extremely Low Success Rates: Most systems achieve near 0% SR; the best performer (OpenAI o3, multi-agent) reaches only 5.1%. Even humans, given unlimited time and tools, achieve only 20% SR in single-attempt mode.
  • Partial Success at Item Level: Item-level F1 scores can approach 80% with sufficient retries, indicating that individual fact retrieval is not the bottleneck.
  • Multi-Agent Advantage: Multi-agent frameworks consistently outperform single-agent setups in F1 scores, leveraging parallelism and task decomposition.
  • Commercial System Limitations: Leading commercial models (Gemini 2.5 Pro, Claude Sonnet 4, OpenAI o3) are not optimized for large-scale, systematic information integration, often failing to produce well-structured outputs. Figure 6

    Figure 6: Heatmap of row-level F1 scores across domains and languages for single-agent and multi-agent frameworks.

Error Analysis

Systematic analysis reveals four primary advanced agentic failure modes:

  1. Incomplete Query Decomposition: Agents fail to generate comprehensive sub-queries, missing key constraints or attributes. Figure 7

    Figure 7: Example of incomplete query decomposition—agent omits necessary sub-queries for required details.

  2. Lack of Reflection and Iterative Refinement: Agents do not adapt search strategies after initial failures, often abandoning tasks prematurely. Figure 8

    Figure 8: Example of lack of reflection—agent fails to refine search after receiving aggregated data.

  3. Failure in Evidence Utilization: Agents misattribute or misinterpret retrieved evidence, leading to incorrect outputs. Figure 9

    Figure 9: Example of evidence utilization failure—agent misattributes GPA requirement from wrong university.

  4. Knowledge Hallucination and Factual Inconsistency: Agents fabricate facts when external information is unavailable, resulting in factual errors. Figure 10

    Figure 10: Example of knowledge hallucination—agent invents entrance fee when no data is available.

Basic failure modes include tool invocation errors, output formatting errors, context length exceedance, and response refusals.

Test-Time Scaling and Human Ceiling

Increasing the number of agent attempts (up to 128) improves item-level F1 scores but does not significantly raise table-level SR, which remains below 20%. This demonstrates that while individual fact retrieval is tractable, achieving exhaustive completeness and accuracy at scale is exceptionally difficult. The annotation of ground-truth tables itself requires multiple rounds of human cross-validation.

Implications and Future Directions

WideSearch exposes fundamental limitations in current LLM-agent architectures for broad information-seeking tasks. The primary bottleneck is not search capability per se, but the lack of advanced agentic skills: comprehensive planning, dynamic reflection, and rigorous evidence grounding. The benchmark sets a high bar for agent reliability, with strict success criteria that mirror real-world requirements for exhaustive and error-free data integration.

The results suggest that future progress will depend on:

  • Sophisticated Multi-Agent Architectures: Parallel search and cross-validation, mimicking collaborative human workflows, are essential for scaling reliability.
  • Enhanced Planning and Reflection Mechanisms: Agents must dynamically decompose queries and iteratively refine strategies in response to partial failures.
  • Robust Evidence Attribution: Strict grounding in external sources is necessary to prevent hallucinations and misattributions.

WideSearch provides a robust, objective testbed for driving research in these directions and for benchmarking future agentic systems.

Conclusion

WideSearch establishes a new standard for evaluating LLM-powered search agents on large-scale, high-fidelity information gathering tasks. The benchmark reveals that current systems, including advanced multi-agent frameworks and commercial solutions, are fundamentally challenged by the demands of completeness and accuracy at scale. The core deficiencies lie in advanced agentic capabilities rather than basic search or reasoning. WideSearch will serve as a critical resource for the development and assessment of next-generation agentic architectures, with multi-agent collaboration and dynamic planning identified as key avenues for future research.

alphaXiv