LLMs often struggle with factual accuracy, outdated knowledge, and real-time information access. Retrieval-Augmented Generation (RAG) helps address this by grounding LLM responses with external information. Agentic RAG further enhances this by employing autonomous LLM agents for dynamic information seeking, often involving multi-turn interactions with tools like search engines and browsers in live web environments, as seen in systems like OpenAI's Deep Research, Gemini, and Perplexity.
However, existing RAG benchmarks are insufficient for evaluating these agentic systems. They typically rely on static, limited corpora and simple queries that don't require complex agent behaviors. Their evaluation methods often depend on pre-defined gold document sets, unsuitable for the dynamic, open-ended nature of the web where information is constantly changing.
To address these limitations, the paper introduces InfoDeepSeek, a new benchmark designed specifically for evaluating agentic information seeking in real-world, dynamic web environments. The core contributions are:
- A systematic methodology for constructing challenging queries: Questions are curated to be determinate (clear, stable answers), difficult (requiring multi-turn search and reasoning, hard for single-turn models), and diverse (covering various domains, languages, and specific attributes like multi-hop reasoning, long-tail knowledge, freshness, time sensitivity, distracting information, and false premises).
- An Agentic RAG framework: A practical system is developed that integrates multiple search and browsing tools (Google, Bing, Yahoo, DuckDuckGo, Selenium browser) for information seeking in live web environments.
- Fine-grained evaluation metrics: A framework tailored for dynamic environments is proposed, including metrics beyond final answer accuracy:
- Answer Accuracy (ACC): Whether the final answer based on all observations matches the ground truth.
- Information Accuracy (IA@k): Whether an answer generated from the top-k selected evidence is correct, assessing evidence quality.
- Effective Evidence Utilization (EEU): The ratio of the best achievable IA@k (across all k) to ACC, measuring how well the agent's selection stage extracts useful information from retrieved observations.
- Information Compactness (IC): Quantifies the density of the selected evidence compared to the necessary ground-truth sources, penalizing redundancy or over-retrieval.
- Empirical evaluation and insights: Extensive experiments using InfoDeepSeek reveal nuanced agent behaviors and performance bottlenecks across different LLMs, search engines, and question types, offering guidance for future research.
Agentic RAG Framework Implementation:
The framework follows the standard RAG stages: Retrieval, Augmentation, and Generation.
- Retrieval Stage: An LLM agent takes the query q, initiates a plan π0, and performs a sequence of up to T steps. At each step t, it observes ot, reflects on its memory ht and plan πt to update πt+1, and then selects and executes an action at+1 using a tool (search engine, browser, time tool, termination) to get ot+1. This loop generates raw observations O={o1,…,oT}. The LLM is prompted with explicit instructions for planning, reflection, memory usage, and tool specifications.
1
2
3
4
5
6
7
8
9
10
|
You are a {agent_name}, {agent_bio}, {agent_instructions}
...
{tool_specification}
{current_date_and_time}
{memory}
Given Query: {query}
...
Based on the given question and existing tasks, plan a new Task (no repetitions), and you can only generate the Task in the following **JSON list** format:
...
A new Task: |
- Augmentation Stage: The agent filters and distills the noisy observations O into a concise set of evidence C=SelectRelevant(q,O). This involves selecting relevant documents/passages and ranking them. The set C has a dynamic size nq up to a maximum n (default n=5). This stage is crucial for noise reduction.
1
2
3
4
5
6
7
|
You are a {agent_name}, {agent_bio}, {agent_instructions}
The current stage is webpage ranking stage. In the previous interactions, you have already found several webpages in response to the user's query. Now, you need to consolidate this information and select the {max_webpage_num} most relevant webpages, then rank them.
...
Given Query: {query}
You must generate the list of webpages strictly in the following **JSON list** format:
...
Relevant webpages (ranked by importance): |
- Generation Stage: The agent produces the final answer y^q=Generate(C,q) based on the selected evidence C and the original query q.
1
2
3
4
5
6
|
You are {agent_name}, {agent_bio}, {agent_instructions}
Currently, you are in the question-answering stage. Based on your own knowledge and relevant webpages, answer the given query from the user.
...
Given query: {query}
Relevant webpages: {webpages}
Generate a brief English answer to solve the user's query: |
Dataset Construction Methodology:
The 245 questions are manually curated through a multi-stage process involving seven annotators:
- Fact-Grounded Query Drafting: Starting from credible web sources (Wikipedia, news, academic sites), annotators identify "anchor knowledge," often long-tail or obscure facts, and formulate questions. A reverse construction approach (start from answer, form question) ensures verifiability. Temporal stability is checked and constrained with time anchors if needed.
- Expand from Anchor Knowledge: Complexity is increased by combining anchors with other facts (common or difficult) through multi-hop composition and linking to other difficulty attributes. This prevents single-turn search solutions.
- Diversification: Annotators proactively seek out questions covering underrepresented attributes, domains (14 covered like politics, science, arts, news), and predominant languages (19 covered, including English, Chinese, Japanese, Spanish, etc.).
- Determinacy and Difficulty Filtering: Drafts are checked against multiple sources for correctness and uniqueness. A key filter involves testing with GPT-4o and DeepSeek-R1 using single-turn web search; questions solvable by both are discarded.
- Multi-Stage Validation: Each question is independently reviewed by two annotators for content, criteria, and metadata. A third adjudicator resolves disagreements, ensuring data quality and adherence to criteria. Annotations include ground-truth answers, source URLs, and metadata.
Evaluation Framework:
Answer correctness is evaluated using both human annotators and LLM-based automatic evaluators (DeepSeek-V3, Gemini-2.0-Flash, GPT-4o-mini as arbiter). To handle false premise questions, where the query contains an incorrect assumption, ground-truth answers explicitly state the false premise, and specialized evaluation prompts are used for LLMs, significantly improving auto-evaluation accuracy (from 95.57% to 99.29%).
Benchmarking Results & Practical Implications:
Experiments using InfoDeepSeek yield several key findings:
- LLM Performance: Even state-of-the-art LLMs struggle on InfoDeepSeek (best ACC 22.45%). Models optimized for reasoning and search (Gemini-2.5-Pro, DeepSeek-R1, Gemini-2.5-Flash) perform better, suggesting the importance of these capabilities for agentic tasks. Most models show EEU below 1, indicating difficulty extracting useful evidence. High IC scores point to redundant retrieval.
- Search Engine Impact: The choice of search engine significantly impacts performance. Google and Yahoo consistently outperform Bing and DuckDuckGo. Better search quality can partially compensate for weaker LLMs. General-purpose search engines are better starting points for agentic RAG.
- Question Attributes: Agents perform better on simpler attributes (false premise, time sensitivity, freshness) and worse on multi-hop, long-tail, and distracting information questions. Improvements from stronger LLMs are less pronounced on harder attributes, highlighting bottlenecks in retrieval and noise handling.
- Test-time Scaling: Performance (ACC, IA@k, IC) improves as the maximum number of retrieval steps (T) increases, confirming that agentic systems can benefit from more computational resources for exploration.
- Retrieval Interference: A significant issue where LLMs correctly answer questions internally but fail after retrieving external information (interference rates up to 80%). This suggests that noisy or tangentially relevant web content can degrade performance. Mitigation strategies are needed, such as improving models' confidence in internal knowledge, better evidence filtering, and knowledge consistency checks.
- Language Impact: English queries generally outperform Chinese. Using a language-aware prompt instructing the agent to search in the query's predominant language yields the best results, especially benefiting LLMs with weaker inherent multilingual capabilities. This highlights the need for language-aware retrieval strategies in a multilingual web.
Implementation Considerations:
Deploying Agentic RAG systems based on these findings requires:
- Selecting capable base LLMs with strong reasoning and generation abilities.
- Integrating high-quality, general-purpose search tools.
- Developing robust evidence filtering and distillation mechanisms to handle noisy web content.
- Implementing strategies to mitigate retrieval interference.
- Incorporating language awareness to leverage multilingual web resources effectively.
- Designing flexible control flow (planning, reflection) to enable multi-turn interaction and adaptation.
- Considering computational budgets (T) and their impact on performance vs. cost.
- Using LLM-based evaluation, carefully designed to handle complex cases like false premises.
Limitations:
The dataset construction is currently manual, which is costly and time-consuming. Future work aims to automate this process with manual verification.
Broader Impacts:
The research has the potential to improve factual accuracy and reliability in AI applications, benefiting fields like healthcare and research. However, it also highlights risks like misinformation amplification and potential misuse if systems retrieve or generate biased/misleading content, emphasizing the need for careful evaluation and mitigation strategies.