Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation (2505.15872v2)

Published 21 May 2025 in cs.IR and cs.CL

Abstract: Retrieval-Augmented Generation (RAG) enhances LLMs by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yunjia Xi (21 papers)
  2. Jianghao Lin (47 papers)
  3. Menghui Zhu (15 papers)
  4. Yongzhao Xiao (1 paper)
  5. Zhuoying Ou (1 paper)
  6. Jiaqi Liu (102 papers)
  7. Tong Wan (6 papers)
  8. Bo Chen (309 papers)
  9. Weiwen Liu (59 papers)
  10. Yasheng Wang (91 papers)
  11. Ruiming Tang (171 papers)
  12. Weinan Zhang (322 papers)
  13. Yong Yu (219 papers)

Summary

LLMs often struggle with factual accuracy, outdated knowledge, and real-time information access. Retrieval-Augmented Generation (RAG) helps address this by grounding LLM responses with external information. Agentic RAG further enhances this by employing autonomous LLM agents for dynamic information seeking, often involving multi-turn interactions with tools like search engines and browsers in live web environments, as seen in systems like OpenAI's Deep Research, Gemini, and Perplexity.

However, existing RAG benchmarks are insufficient for evaluating these agentic systems. They typically rely on static, limited corpora and simple queries that don't require complex agent behaviors. Their evaluation methods often depend on pre-defined gold document sets, unsuitable for the dynamic, open-ended nature of the web where information is constantly changing.

To address these limitations, the paper introduces InfoDeepSeek, a new benchmark designed specifically for evaluating agentic information seeking in real-world, dynamic web environments. The core contributions are:

  1. A systematic methodology for constructing challenging queries: Questions are curated to be determinate (clear, stable answers), difficult (requiring multi-turn search and reasoning, hard for single-turn models), and diverse (covering various domains, languages, and specific attributes like multi-hop reasoning, long-tail knowledge, freshness, time sensitivity, distracting information, and false premises).
  2. An Agentic RAG framework: A practical system is developed that integrates multiple search and browsing tools (Google, Bing, Yahoo, DuckDuckGo, Selenium browser) for information seeking in live web environments.
  3. Fine-grained evaluation metrics: A framework tailored for dynamic environments is proposed, including metrics beyond final answer accuracy:
    • Answer Accuracy (ACC): Whether the final answer based on all observations matches the ground truth.
    • Information Accuracy (IA@k): Whether an answer generated from the top-k selected evidence is correct, assessing evidence quality.
    • Effective Evidence Utilization (EEU): The ratio of the best achievable IA@k (across all k) to ACC, measuring how well the agent's selection stage extracts useful information from retrieved observations.
    • Information Compactness (IC): Quantifies the density of the selected evidence compared to the necessary ground-truth sources, penalizing redundancy or over-retrieval.
  4. Empirical evaluation and insights: Extensive experiments using InfoDeepSeek reveal nuanced agent behaviors and performance bottlenecks across different LLMs, search engines, and question types, offering guidance for future research.

Agentic RAG Framework Implementation:

The framework follows the standard RAG stages: Retrieval, Augmentation, and Generation.

  • Retrieval Stage: An LLM agent takes the query qq, initiates a plan π0\pi_0, and performs a sequence of up to TT steps. At each step tt, it observes oto_t, reflects on its memory hth_t and plan πt\pi_t to update πt+1\pi_{t+1}, and then selects and executes an action at+1a_{t+1} using a tool (search engine, browser, time tool, termination) to get ot+1o_{t+1}. This loop generates raw observations O={o1,,oT}O=\{o_1, \dots, o_T\}. The LLM is prompted with explicit instructions for planning, reflection, memory usage, and tool specifications.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    You are a {agent_name}, {agent_bio}, {agent_instructions}
    ...
    {tool_specification}
    {current_date_and_time}
    {memory}
    Given Query: {query}
    ...
    Based on the given question and existing tasks, plan a new Task (no repetitions), and you can only generate the Task in the following **JSON list** format:
    ...
    A new Task:
  • Augmentation Stage: The agent filters and distills the noisy observations OO into a concise set of evidence C=SelectRelevant(q,O)C=\text{SelectRelevant}(q,O). This involves selecting relevant documents/passages and ranking them. The set CC has a dynamic size nqn_q up to a maximum nn (default n=5n=5). This stage is crucial for noise reduction.
    1
    2
    3
    4
    5
    6
    7
    
    You are a {agent_name}, {agent_bio}, {agent_instructions}
    The current stage is webpage ranking stage. In the previous interactions, you have already found several webpages in response to the user's query. Now, you need to consolidate this information and select the {max_webpage_num} most relevant webpages, then rank them.
    ...
    Given Query: {query}
    You must generate the list of webpages strictly in the following **JSON list** format:
    ...
    Relevant webpages (ranked by importance):
  • Generation Stage: The agent produces the final answer y^q=Generate(C,q)\hat{y}_q=\text{Generate}(C,q) based on the selected evidence CC and the original query qq.
    1
    2
    3
    4
    5
    6
    
    You are {agent_name}, {agent_bio}, {agent_instructions}
    Currently, you are in the question-answering stage. Based on your own knowledge and relevant webpages, answer the given query from the user.
    ...
    Given query: {query}
    Relevant webpages: {webpages}
    Generate a brief English answer to solve the user's query:

Dataset Construction Methodology:

The 245 questions are manually curated through a multi-stage process involving seven annotators:

  1. Fact-Grounded Query Drafting: Starting from credible web sources (Wikipedia, news, academic sites), annotators identify "anchor knowledge," often long-tail or obscure facts, and formulate questions. A reverse construction approach (start from answer, form question) ensures verifiability. Temporal stability is checked and constrained with time anchors if needed.
  2. Expand from Anchor Knowledge: Complexity is increased by combining anchors with other facts (common or difficult) through multi-hop composition and linking to other difficulty attributes. This prevents single-turn search solutions.
  3. Diversification: Annotators proactively seek out questions covering underrepresented attributes, domains (14 covered like politics, science, arts, news), and predominant languages (19 covered, including English, Chinese, Japanese, Spanish, etc.).
  4. Determinacy and Difficulty Filtering: Drafts are checked against multiple sources for correctness and uniqueness. A key filter involves testing with GPT-4o and DeepSeek-R1 using single-turn web search; questions solvable by both are discarded.
  5. Multi-Stage Validation: Each question is independently reviewed by two annotators for content, criteria, and metadata. A third adjudicator resolves disagreements, ensuring data quality and adherence to criteria. Annotations include ground-truth answers, source URLs, and metadata.

Evaluation Framework:

Answer correctness is evaluated using both human annotators and LLM-based automatic evaluators (DeepSeek-V3, Gemini-2.0-Flash, GPT-4o-mini as arbiter). To handle false premise questions, where the query contains an incorrect assumption, ground-truth answers explicitly state the false premise, and specialized evaluation prompts are used for LLMs, significantly improving auto-evaluation accuracy (from 95.57% to 99.29%).

Benchmarking Results & Practical Implications:

Experiments using InfoDeepSeek yield several key findings:

  • LLM Performance: Even state-of-the-art LLMs struggle on InfoDeepSeek (best ACC 22.45%). Models optimized for reasoning and search (Gemini-2.5-Pro, DeepSeek-R1, Gemini-2.5-Flash) perform better, suggesting the importance of these capabilities for agentic tasks. Most models show EEU below 1, indicating difficulty extracting useful evidence. High IC scores point to redundant retrieval.
  • Search Engine Impact: The choice of search engine significantly impacts performance. Google and Yahoo consistently outperform Bing and DuckDuckGo. Better search quality can partially compensate for weaker LLMs. General-purpose search engines are better starting points for agentic RAG.
  • Question Attributes: Agents perform better on simpler attributes (false premise, time sensitivity, freshness) and worse on multi-hop, long-tail, and distracting information questions. Improvements from stronger LLMs are less pronounced on harder attributes, highlighting bottlenecks in retrieval and noise handling.
  • Test-time Scaling: Performance (ACC, IA@k, IC) improves as the maximum number of retrieval steps (TT) increases, confirming that agentic systems can benefit from more computational resources for exploration.
  • Retrieval Interference: A significant issue where LLMs correctly answer questions internally but fail after retrieving external information (interference rates up to 80%). This suggests that noisy or tangentially relevant web content can degrade performance. Mitigation strategies are needed, such as improving models' confidence in internal knowledge, better evidence filtering, and knowledge consistency checks.
  • Language Impact: English queries generally outperform Chinese. Using a language-aware prompt instructing the agent to search in the query's predominant language yields the best results, especially benefiting LLMs with weaker inherent multilingual capabilities. This highlights the need for language-aware retrieval strategies in a multilingual web.

Implementation Considerations:

Deploying Agentic RAG systems based on these findings requires:

  • Selecting capable base LLMs with strong reasoning and generation abilities.
  • Integrating high-quality, general-purpose search tools.
  • Developing robust evidence filtering and distillation mechanisms to handle noisy web content.
  • Implementing strategies to mitigate retrieval interference.
  • Incorporating language awareness to leverage multilingual web resources effectively.
  • Designing flexible control flow (planning, reflection) to enable multi-turn interaction and adaptation.
  • Considering computational budgets (TT) and their impact on performance vs. cost.
  • Using LLM-based evaluation, carefully designed to handle complex cases like false premises.

Limitations:

The dataset construction is currently manual, which is costly and time-consuming. Future work aims to automate this process with manual verification.

Broader Impacts:

The research has the potential to improve factual accuracy and reliability in AI applications, benefiting fields like healthcare and research. However, it also highlights risks like misinformation amplification and potential misuse if systems retrieve or generate biased/misleading content, emphasizing the need for careful evaluation and mitigation strategies.