R1-Searcher Framework

Updated 1 July 2025

R1-Searcher is a family of reinforcement learning frameworks enabling large language models (LLMs) to autonomously search for and integrate external knowledge for improved reasoning.
It employs a two-stage outcome-based RL training protocol to teach LLMs to invoke search and incorporate retrieved information without needing supervised fine-tuning.
R1-Searcher models demonstrate strong empirical results, achieving significant accuracy gains (e.g., +48% on HotpotQA) and showing better generalization and reduced hallucination on out-of-domain tasks.

R1-Searcher refers to a family of reinforcement learning–based frameworks designed to improve the autonomous search capability of LLMs, with a focus on retrieval-augmented reasoning and integration of external knowledge sources during problem solving. The central goal is to incentivize LLMs not only to recognize the need for external information, but also to invoke search operations, use external evidence in step-by-step reasoning, and deliver accurate, up-to-date answers—especially on knowledge-intensive or time-sensitive queries. R1-Searcher models are trained entirely via outcome-based reinforcement learning (RL), eschewing traditional supervised fine-tuning or process-level supervision, and demonstrate strong out-of-domain generalization and reductions in hallucinations relative to prior retrieval-augmented methods.

1. Two-Stage Outcome-Based Reinforcement Learning Framework

The R1-Searcher approach is characterized by a two-stage RL protocol that enables search invocation and reasoning integration without the need for process rewards or distillation.

Stage 1: The LLM is trained to master the invocation of an external search API. The reward function is strictly format-based, providing a positive signal if the model emits a correctly formatted search query (e.g., with explicit tags such as <|begin_of_query|>...<|end_of_query|>), regardless of whether the answer is correct. This stage ensures that the LLM robustly internalizes the act of tool-calling as part of its generation repertoire.
Stage 2: The model is incentivized to use search system responses during reasoning to generate correct answers. The reward function is a composite of (i) answer correctness (measured, e.g., by F1 score with respect to gold answers), and (ii) format compliance. The model is expected both to invoke search when appropriate and integrate retrieved knowledge into its final answer.

Training does not involve supervised traces or process-level interventions; initial behavior is shaped entirely by RL signals on final outcome and format. This exclusively outcome-based design enables cold-start training and supports both base and instruction-tuned foundation models.

2. Mechanism for Autonomous Search Invocation and Reasoning

R1-Searcher models are trained to recognize uncertainty or knowledge gaps and to autonomously trigger search events mid-reasoning. This is operationalized as follows:

Search Invocation: The LLM generates a special markup block (e.g., <|begin_of_query|> search string <|end_of_query|>) indicating a search request. Search execution is implemented externally: when the end tag is produced, the query is dispatched to a retrieval engine and the resulting documents are injected back into the dialogue (e.g., within <|begin_of_documents|> ... <|end_of_documents|> tags).
Alternating Reasoning and Retrieval: In a full rollout, the model alternates between generation, search invocation, and processing retrieved content. Retrieved tokens are inserted into the generation process as new context following each query.
Retrieval-Masked Loss: Critically, only tokens generated by the LLM itself (not those copied from the retrieved text) receive gradient updates during RL. This prevents reward hacking by copying retrieved content verbatim, ensuring that the model is rewarded for its own reasoning behavior.

The prompt format explicitly separates reasoning steps (>), answers (<answer>), queries, and retrieved information, supporting fine-grained control and transparency in deployment.

3. Empirical Results and Comparison to Predecessor Methods

R1-Searcher demonstrates compelling improvements over competitive retrieval-augmented generation (RAG) and RL-based methods.

On multi-hop QA datasets (HotpotQA, 2WikiMultiHopQA, Bamboogle, Musique), R1-Searcher (Qwen-2.5-7B-Base) achieved LLM-as-Judge accuracies of 0.750 on HotpotQA and 0.650 on 2Wiki, outperforming prior state-of-the-art (e.g., ReARTeR (GPT-4o-mini) with 0.506 and 0.534, respectively) by 48% and 21% relative improvement.

On knowledge-intensive benchmarks outside the training domain (e.g., Bamboogle and Musique, and online search settings using the Google API), R1-Searcher maintained high accuracy and outperformed much larger models (including Search-o1-32B) by up to 11.4% on Bamboogle.

Experimental comparisons indicate that R1-Searcher is more efficient and generalizable than both supervised fine-tuned models and strong test-time search planners (like tree-search with MCTS), supporting both Qwen-2.5-7B-Base and Llama-3.1-8B-Instruct backbones.

A key finding is that base models (not instruction-tuned) were able to outperform their instruction-tuned variants under R1-Searcher RL training, suggesting less overfit to static knowledge and higher adaptability to retrieval-augmented reasoning.

4. Generalization and Model Support

R1-Searcher is designed for robust generalization, both in terms of model architecture and domain distribution:

Out-of-Domain Generalization: Evaluations on datasets and search domains unseen during RL training demonstrate strong generalization, including real-time search environments. The framework adapts without additional SFT or retraining.

Model Agnosticism: The RL protocol is applicable out-of-the-box to both base LLMs and instruction-tuned variants, without modification to reward structure or training regimen.

Zero-Shot Performance: The framework achieves strong zero-shot performance on knowledge-intensive and multi-hop tasks without requiring large-scale supervised data.

This suggests that outcome-based RL on structured search/reasoning formats can yield models capable of broader problem-solving beyond benchmark-specific adaptation.

5. Technical Formulations and Learning Algorithms

The training of R1-Searcher employs specific RL algorithms and token-level formulations for stable optimization:
Reward Functions (per stage):

Stage 1 (Format and Retrieval):

$R_{\text{retrieval}} = \begin{cases} 0.5, & n \geq 1 \ 0, & n = 0 \end{cases}$

$R_{\text{format}} = \begin{cases} 0.5, & \text{if format correct} \ 0, & \text{otherwise} \end{cases}$ - Stage 2 (Answer and Format):

$R'_{\text{format}} = \begin{cases} 0, & \text{if format correct} \ -2, & \text{otherwise} \end{cases}$

$\mathrm{F1} = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

Training Scheme: Modified REINFORCE++ algorithm with RAG rollouts and retrieval-masked loss; retrieved tokens do not contribute to the gradient, preventing reward exploits.
Prompt Format (structured example):
1
2
3
4
<think> ...chain-of-thought... </think>
<answer> ...short answer... </answer>
<|begin_of_query|> ...search query... <|end_of_query|>
<|begin_of_documents|> ...retrieved docs... <|end_of_documents|>
This structure enables effective alternation between internal reasoning and explicit search actions.

6. Avoiding Hallucination and Improving Real-World Factuality

A core advantage of the R1-Searcher RL paradigm is its explicit incentive structure for seeking external information only when internal knowledge is insufficient:

Minimizing Hallucination: The model learns to invoke retrieval in contexts where internal knowledge is likely unreliable, decreasing incorrect generations.

Factual Updates and Knowledge Gaps: Retrieval during reasoning allows the LLM to supply up-to-date answers on current events, obscure facts, or private/unseen domains, as opposed to relying solely on its training data.

Empirical evidence shows that R1-Searcher models will use search adaptively (e.g., when the model's uncertainty is high), whereas supervised and instruction-tuned LLMs tend to make unfounded guesses.

7. Practical Implications and Extensions

The R1-Searcher framework opens pathways for more robust, agentic, and tool-using LLM systems:

Retrieval-Augmented Agents: LLMs equipped via R1-Searcher can serve as automated research assistants, enterprise QA bots, and domain-expert conversational tools capable of real-time fact retrieval.

Data-Efficient and Adaptable Reasoning: By unifying search and reasoning in a token-level RL training loop, these models reduce dependency on large supervised datasets or search traces for each application domain.

Foundation for Further Research: R1-Searcher's approach—incentivizing search invocation as part of reasoning, not just retrieval as a post-hoc fix—can serve as the basis for multi-modal, multi-tool LLM agents, and inspire further methods in RL for knowledge-intensive LLM applications.

Aspect R1-Searcher Result or Principle

Training Approach Two-stage, outcome-based RL: search invocation then reasoning+retrieval

Mechanism Autonomous, tag-based search triggering; retrieval-masked loss

Empirical Gains +48% (HotpotQA) vs. best prior; strong out-of-domain performance

Hallucination Handling RL incentives for search on uncertainty; reduction in unsupported answers

Model Generality Supports both base/instruct variants; adapts to online sources

Deployment Implications Up-to-date QA, multi-hop reasoning, agentic search in real-time

In summary, R1-Searcher establishes a new reference for retrieval-augmented LLMs, leveraging RL to yield models that deliberate over information needs and autonomously seek, incorporate, and reason with external evidence during generation, resulting in increased factuality, generalization, and efficiency over previous paradigms.

Aspect	R1-Searcher Result or Principle
Training Approach	Two-stage, outcome-based RL: search invocation then reasoning+retrieval
Mechanism	Autonomous, tag-based search triggering; retrieval-masked loss
Empirical Gains	+48% (HotpotQA) vs. best prior; strong out-of-domain performance
Hallucination Handling	RL incentives for search on uncertainty; reduction in unsupported answers
Model Generality	Supports both base/instruct variants; adapts to online sources
Deployment Implications	Up-to-date QA, multi-hop reasoning, agentic search in real-time

PDF Markdown Chat (Upgrade)

R1-Searcher Framework

1. Two-Stage Outcome-Based Reinforcement Learning Framework

2. Mechanism for Autonomous Search Invocation and Reasoning

3. Empirical Results and Comparison to Predecessor Methods

4. Generalization and Model Support

5. Technical Formulations and Learning Algorithms

6. Avoiding Hallucination and Improving Real-World Factuality

7. Practical Implications and Extensions

Follow-up Questions

Don't miss out on important new AI/ML research

R1-Searcher Framework

1. Two-Stage Outcome-Based Reinforcement Learning Framework

2. Mechanism for Autonomous Search Invocation and Reasoning

3. Empirical Results and Comparison to Predecessor Methods

4. Generalization and Model Support

5. Technical Formulations and Learning Algorithms

6. Avoiding Hallucination and Improving Real-World Factuality

7. Practical Implications and Extensions

Follow-up Questions

Related Topics

Don't miss out on important new AI/ML research