DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments (2504.03160v4)

Published 4 Apr 2025 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.

PDF Abstract

DeepResearcher addresses the limitations of current methods for developing LLM-based deep research agents. Existing approaches typically fall into two categories: prompt engineering-based methods, which rely on manually crafted prompts and often exhibit brittle performance, and RAG-based RL methods, which train agents within controlled environments using static document corpora. RAG-based approaches, while enabling RL training, fail to capture the complexities and dynamics of real-world information retrieval, where agents must navigate the noisy, unstructured, and constantly evolving open web. DeepResearcher posits that interaction with the authentic web environment during training is crucial for developing robust and generalizable research capabilities.

Framework Overview and Methodology

DeepResearcher introduces a framework for the end-to-end training of deep research agents using Reinforcement Learning (RL) directly within real-world web environments. The core idea is to train an LLM agent (using Qwen2.5-7B-Instruct as the backbone) to perform iterative reasoning and tool use (web search and web browsing) to answer open-domain questions. Unlike RAG-based systems that operate on fixed local corpora, DeepResearcher agents interact with live search engine APIs (e.g., Google Search API, Serper) and crawl actual webpages, learning to handle the inherent challenges of the open web.

The training employs Group Relative Policy Optimization (GRPO), an RL algorithm suitable for LLM fine-tuning. The agent follows an iterative reasoning loop: it generates internal thought processes (>), selects a tool (web_search or web_browse), receives observations (search snippets or browsed content), and continues this cycle until it decides to generate a final answer (<answer>). The reward function is based solely on the outcome: the word-level F1 score between the generated answer and the ground truth, with a penalty (-1) for incorrectly formatted output. This outcome-based reward allows the agent to learn complex strategies for searching, browsing, filtering, and synthesizing information without explicit supervision on intermediate steps. During backpropagation, observations from tool use are masked to focus learning on the agent's reasoning and decision-making policies.

Architecture and Implementation Details

The DeepResearcher architecture comprises several key components:

Backbone LLM: Qwen2.5-7B-Instruct serves as the base model for the agent.

RL Algorithm: GRPO is utilized for policy optimization, chosen for its effectiveness in LLM training contexts.

Tool Integration:

web_search: Interacts with external search APIs via JSON requests, returning structured results (title, URL, snippet) for the top-k retrieved pages.

web_browse: Implemented as a multi-agent system. When the main agent invokes web_browse with a URL, a dedicated browsing/reading agent takes control. This sub-agent sequentially processes webpage content segments, maintaining short-term memory specific to the query. It decides whether to continue reading or stop based on the relevance of the content to the query and memory. Upon stopping, it returns only the newly extracted relevant information to the main agent, efficiently handling long or irrelevant pages.

Training Infrastructure: Leveraging the verl framework, training involves significant parallel processing to handle the high volume of tool interactions during RL rollouts. A distributed cluster of 50 CPU nodes is employed to manage the concurrent search and browse requests generated by GRPO sampling. Robust retry mechanisms and caching (7-day TTL for identical search queries) are implemented to mitigate API rate limits, reduce costs, and handle transient web crawling failures.

Data quality is maintained through a two-stage filtering process. First, low-quality questions (time-sensitive, subjective, harmful) are removed. Second, contamination detection is performed by testing the base model's ability to answer questions without search (pass@10 check); questions answered correctly are excluded to ensure the agent learns research skills rather than relying on memorized knowledge.

Quantitative Performance Analysis

DeepResearcher was evaluated on a range of open-domain question-answering benchmarks, including NQ, TQ, HotpotQA, 2Wiki (in-domain), MuSiQue, Bamboogle, and PopQA (out-of-domain), using F1 score and Model-Based Evaluation (MBE) with GPT-4o-mini.

Comparison with Prompt Engineering: DeepResearcher demonstrated substantial improvements, achieving gains of up to 28.9 MBE points over prompt engineering baselines (e.g., Search-o1 + Web Search). On the Bamboogle dataset, DeepResearcher achieved an MBE score of 72.8, compared to 53.6 for the baseline.

Comparison with RAG-based RL: The framework outperformed RAG-based RL agents (e.g., Search-r1-base, R1-Searcher) by up to 7.2 MBE points, even when RAG agents were granted web access during inference. On Bamboogle, DeepResearcher (72.8 MBE) significantly surpassed Search-r1-base (57.6 MBE) and R1-Searcher (65.6 MBE).

In-Domain Performance: DeepResearcher outperformed or performed comparably to baselines across NQ, TQ, HotpotQA, and 2Wiki. Notably, it achieved strong results on TQ and 2Wiki and matched the performance of Search-r1-base on NQ/HotpotQA despite the latter operating in a simpler, controlled RAG environment.

Out-of-Domain Generalization: The framework consistently outperformed all baselines on MuSiQue, Bamboogle, and PopQA, demonstrating superior generalization attributed to training in the real-world web environment. Its strong performance on Bamboogle, which requires accessing information beyond Wikipedia, particularly highlights the benefits over RAG-based methods trained on limited corpora.

Emergent Cognitive Behaviors

Qualitative analysis revealed that the end-to-end RL training in the authentic web environment fostered the emergence of several complex cognitive behaviors without explicit programming:

Planning: Agents developed the ability to formulate multi-step research plans for complex questions and dynamically adjust these plans based on intermediate findings.

Cross-Validation: Agents frequently performed additional searches or browsing actions even after finding a plausible answer, seeking corroborating evidence from multiple sources before finalizing their response.

Reflection: Agents demonstrated the capacity to recognize when retrieved information was insufficient or misaligned with the query, allowing them to refine search strategies and avoid unproductive paths.

Honesty: Agents learned to acknowledge when a definitive answer could not be found within the available information, declining to answer rather than generating potentially inaccurate or hallucinated responses.

Implications and Conclusion

The results presented in the DeepResearcher paper strongly suggest that training LLM-based research agents via RL directly within real-world web environments is not merely an implementation choice but potentially a fundamental requirement for building robust, generalizable, and practically applicable agents. The limitations of simulated RAG environments in capturing the complexity of the open web are highlighted by the performance gap between DeepResearcher and RAG-based RL baselines, especially on out-of-domain tasks requiring diverse information sources.

The work demonstrates the viability of scaling RL for complex, multi-step reasoning and tool-use tasks involving real-world interaction. The emergence of sophisticated behaviors like cross-validation and honesty underscores the potential of end-to-end RL to instill desirable cognitive traits. However, it also points to the need for more nuanced evaluation metrics and reward functions that can assess not just answer accuracy but also the reliability and reasoning process of the agent, particularly for long-form answers and situations requiring acknowledged uncertainty.

In conclusion, DeepResearcher presents a significant step towards building more capable AI research assistants by emphasizing the necessity of training within the target operational environment – the real-world web. Its architecture, training methodology, and demonstrated performance provide a strong foundation and open-source framework for future research in scaling RL for complex, interactive AI systems. Future directions may involve developing adaptive tool parameters and more sophisticated reward modeling.