R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning (2503.05592v2)

Published 7 Mar 2025 in cs.AI, cs.CL, and cs.IR

Abstract: Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of LLMs~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

Summary

The paper introduces R1-Searcher, an innovative reinforcement learning framework that enables LLMs to effectively use external search systems without distillation or cold starts.
R1-Searcher employs a two-stage outcome-supervised RL process, first optimizing retrieval invocation, then refining the use of results for accurate problem-solving.
Experimental results show R1-Searcher significantly improves performance on multi-hop QA benchmarks, outperforming strong RAG baselines, including GPT-4o-mini, by up to 48.22%.

R1-Searcher: Incentivizing Search Capabilities in LLMs via Reinforcement Learning

The paper introduces R1-Searcher, an innovative reinforcement learning (RL)-based framework aimed at enhancing the search capabilities of LLMs. By allowing LLMs to autonomously invoke external search systems, R1-Searcher addresses a critical challenge faced by conventional Large Reasoning Models (LRMs) relying primarily on internal knowledge, which often results in inaccuracies when dealing with time-sensitive or knowledge-intensive questions. The distinctive feature of this framework is its exclusive reliance on outcome-based RL, eschewing the need for distillation or cold starts typically utilized in RL applications within LLMs.

The methodology implemented in R1-Searcher involves a two-stage outcome-supervised RL process. The first stage focuses on optimizing the model's ability to perform retrieval operations independently of final answer correctness, employing a retrieve-reward system. This incentivizes the model to conform to the correct retrieval invocation format. In the second stage, an answer reward is introduced, guiding the model towards effective use of external retrieval systems for accurate problem-solving.

The experimental validation of R1-Searcher utilizes four multi-hop QA benchmarks, with performance results demonstrating significant improvements over existing methods. The method outperformed the strong RAG baselines, including closed-source models such as GPT-4o-mini, with improvements reaching up to 48.22% on HotpotQA and 21.72% on 2Wiki datasets. The analysis underscores R1-Searcher's ability to generalize across both in-domain and out-of-domain datasets.

The practical implications of this research are substantial, offering a more dynamic and responsive approach to information retrieval in real-time, which is highly applicable in dynamic environments requiring the most current information. Theoretically, this work pushes the boundary of how RL can be effectively leveraged to enhance complex reasoning capabilities in LLMs, with potential to extend these capabilities across broader AI applications.

Looking towards the future, further developments could include more sophisticated data curriculum designs and scaling the model to larger capacities, such as 32B models, to extensively test the robustness and applicability of R1-Searcher. Additionally, exploring the impact of different RL algorithms and reward structures could provide deeper insights and potentially unlock further performance gains in search-augmented LLMs.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1898942923867095098

https://twitter.com/fly51fly/status/1899207015022674283

https://twitter.com/Athekunal/status/1914813165336248458

https://twitter.com/javaeeeee1/status/1899059175243558925

https://twitter.com/shahip2016/status/1900177389969432639

https://twitter.com/n0riskn0r3ward/status/1915539547095449771