Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
93 tokens/sec
Gemini 2.5 Pro Premium
54 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
17 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
441 tokens/sec
Kimi K2 via Groq Premium
225 tokens/sec
2000 character limit reached

WideSearch: Benchmarking Agentic Broad Info-Seeking (2508.07999v1)

Published 11 Aug 2025 in cs.CL

Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of LLMs, automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

Summary

  • The paper introduces WideSearch, a benchmark for LLM-powered agents performing exhaustive information aggregation across 200 curated tasks in English and Chinese.
  • It details a five-stage curation and evaluation methodology using success rate, row-level F1, and item-level F1 metrics to assess agent performance.
  • Experiments reveal multi-agent systems outperform single-agent frameworks, highlighting persistent challenges in query decomposition, evidence grounding, and scalability.

WideSearch: Benchmarking Agentic Broad Info-Seeking

Motivation and Problem Definition

WideSearch addresses a critical gap in the evaluation of LLM-powered search agents: their ability to perform large-scale, high-fidelity information gathering across diverse domains. Unlike DeepSearch (vertical, multi-hop reasoning for hard-to-find facts) and DeepResearch (complex synthesis for report generation), WideSearch focuses on tasks characterized by operational scale rather than cognitive complexity. These tasks require agents to exhaustively collect atomic facts for a set of entities and organize them into structured outputs, emulating real-world scenarios such as compiling sector-wide financial data or aggregating university admissions requirements. Figure 1

Figure 1: A conceptual comparison of manual and agent-based approaches for WideSearch tasks, highlighting the operational workflow and limitations of each methodology.

Figure 2

Figure 2: Overview and detailed comparison of DeepSearch, DeepResearch, and WideSearch, illustrating their distinct operational domains and evaluation paradigms.

Benchmark Construction and Methodology

WideSearch comprises 200 manually curated tasks (100 English, 100 Chinese) spanning 18 topics, each requiring the agent to populate a table with entity-attribute pairs sourced from the web. The benchmark design enforces six principles: high search volume, temporal/contextual invariance, objective verifiability, public accessibility, reliance on external tools, and scenario diversity. Tasks are sourced from real user queries, refined by domain experts, and subjected to a five-stage curation and validation pipeline to ensure complexity, verifiability, and resistance to parametric knowledge. Figure 3

Figure 3: Integrated data pipeline for WideSearch, detailing the five-stage curation and automated evaluation process.

The annotation protocol includes exhaustive web search, recording of procedural metrics (completion time, queries issued, web pages consulted), parametric knowledge filtering, and iterative validation to align automated and human scoring. The final dataset exhibits substantial complexity: average human completion time is 2.33 hours, with annotators consulting an average of 44.1 unique web pages per task. Figure 4

Figure 4: Distribution of 18 topics across the 200 WideSearch tasks, ensuring broad domain coverage.

Figure 5

Figure 5

Figure 5: Statistical distributions of completion time and breadth of research for Chinese and English tasks, demonstrating the procedural depth required.

Evaluation Framework

WideSearch employs a hybrid automated evaluation pipeline combining deterministic rule-based checks and LLM-as-a-judge semantic scoring. Each agent output is parsed, normalized, and aligned with ground-truth tables using primary keys. Cell-wise evaluation leverages exact match, numerical/date/URL normalization, and LLM-based semantic equivalence for complex columns. Metrics include:

  • Success Rate (SR): Strict table-level match.
  • Row-level F1: Precision/recall for complete entity records.
  • Item-level F1: Fine-grained cell accuracy.

Multiple runs per task (N) are aggregated via Avg@N, Pass@N, and Max@N strategies to capture both average and peak agent performance.

Experimental Results

WideSearch benchmarks over 10 state-of-the-art agentic systems, including single-agent, multi-agent, and commercial end-to-end frameworks. All agents are equipped with standardized search and web reading tools. The multi-agent framework decomposes queries and executes parallel sub-tasks, mimicking collaborative human workflows.

Key findings:

  • Success rates are extremely low: Most systems achieve near 0% SR; the best performer (OpenAI o3, multi-agent) reaches only 5.1%.
  • Item-level F1 can be high: With sufficient retries, item-level F1 approaches 80%, indicating that atomic fact retrieval is not the bottleneck.
  • Human annotators also struggle: Single annotator SR is 20%, underscoring the inherent difficulty and the strictness of the evaluation criteria.
  • Multi-agent frameworks outperform single-agent: Consistent improvements in F1 scores across all domains and languages, validating the divide-and-conquer approach. Figure 6

    Figure 6: Model performance (Row-level F1 Score) across domains and languages, comparing single-agent and multi-agent frameworks.

Error Analysis

Systematic analysis reveals four primary advanced agentic failure modes:

  1. Incomplete Query Decomposition: Agents fail to generate comprehensive sub-queries, missing key constraints or attributes. Figure 7

    Figure 7: Example of incomplete query decomposition, where the agent omits necessary sub-queries for additional attributes.

  2. Lack of Reflection and Iterative Refinement: Agents do not adapt search strategies after initial failures, often abandoning tasks prematurely. Figure 8

    Figure 8: Example of lack of reflection, with the agent failing to refine its search after receiving aggregated data.

  3. Failure in Evidence Utilization: Agents misattribute or misinterpret retrieved evidence, leading to incorrect outputs. Figure 9

    Figure 9: Example of evidence utilization failure, with incorrect attribution of GPA requirements.

  4. Knowledge Hallucination and Factual Inconsistency: Agents fabricate facts when external information is unavailable, violating grounding requirements. Figure 10

    Figure 10: Example of knowledge hallucination, with the agent inventing a non-existent entrance fee.

Basic failure modes include tool invocation errors, output formatting errors, context length exceedance, and response refusals.

Test-Time Scaling and Human Consistency

Test-time scaling experiments (up to 128 attempts per task) show that while item-level F1 improves substantially, table-level SR remains low (<20%), confirming that exhaustive, error-free aggregation is the core challenge. The automated evaluation pipeline achieves >97.8% consistency with human judgment, validating its reliability.

Implications and Future Directions

WideSearch demonstrates that current LLM-based agents are fundamentally limited in large-scale, high-fidelity information seeking. The bottleneck is not atomic fact retrieval but the orchestration of comprehensive, error-free aggregation and verification. The strict all-or-nothing SR metric exposes the fragility of agentic workflows: any omission, hallucination, or misattribution results in total task failure.

The results highlight the urgent need for:

  • Advanced agentic capabilities: Improved planning, reflection, evidence grounding, and iterative refinement.
  • Robust multi-agent architectures: Parallel search and cross-validation to mimic collaborative human annotation.
  • Domain-adaptive strategies: Tailored workflows for challenging domains (e.g., academia, transportation).
  • Evaluation at scale: Benchmarks like WideSearch are essential for driving progress in agent reliability and real-world applicability.

Conclusion

WideSearch establishes a rigorous benchmark for evaluating agentic broad info-seeking, revealing critical deficiencies in current LLM-powered search agents. The findings indicate that future progress depends on the development of sophisticated agent architectures, particularly multi-agent systems capable of parallel search and cross-validation. WideSearch provides a robust testbed for measuring and advancing the reliability, completeness, and fidelity of agentic information gathering at scale.

alphaXiv