Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents (2505.15117v1)

Published 21 May 2025 in cs.CL, cs.AI, and cs.IR

Abstract: Reinforcement learning (RL) has demonstrated strong potential in training LLMs capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.

Summary

  • The paper shows that incorporating format rewards boosts RL training performance and accelerates convergence for LLM agents using general-purpose models.
  • It finds that general-purpose LLM backbones enable more stable search interactions compared to reasoning-specialized models.
  • The study reveals that high-quality search engines are essential for achieving stable and efficient agent performance during training and inference.

This paper, "An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents" (2505.15117), explores the use of Reinforcement Learning (RL) to train LLMs to act as agents that can interleave reasoning steps with calls to external search engines. This approach aims to overcome limitations of purely prompt-based or Supervised Fine-Tuning (SFT) methods for building such agents, particularly the need for costly, manually annotated intermediate trajectories in SFT.

The core concept involves training an LLM policy (πθ\pi_\theta) to interact with a search engine (RR) to solve problems. The agent follows a loop: reasoning \rightarrow search (query formulation) \rightarrow context (retrieved info) \rightarrow reasoning \rightarrow ... \rightarrow final answering. The training uses an RL objective to maximize a reward function based on the outcome. The paper conducts empirical studies to understand how three key factors influence this RL training process:

  1. Reward Formulation: How different types of rewards beyond just the final outcome affect training.
  2. Underlying LLM Backbone: The impact of the base LLM's characteristics (type and scale).
  3. Role of the Search Engine: How the search engine used during training and inference influences the agent.

Key Findings and Practical Implications

The paper provides several actionable insights for practitioners implementing RL-based search agents:

1. Reward Formulation:

  • Format Reward: The paper investigates adding a reward for adhering to the required interaction format (e.g., using specific tokens like >, <search>, <information>, <answer>). The format reward function used is: > * Finding: Incorporating a format reward significantly improves final performance, especially when training from a base LLM (vs. an instruction-tuned one) and accelerates RL convergence. > * Implication: For training agents, particularly from models less proficient at following complex instructions, explicitly rewarding correct interaction format is highly beneficial. Tuning the λf\lambda_f parameter is important, as too small a value is ineffective, and too large can lead to overfitting to the format over the task outcome. > > * Intermediate Retrieval Reward: The paper explores adding a reward component (λr\lambda_r) for retrieving documents containing the ground truth answer, even if the final answer is incorrect. This aims to incentivize generating good search queries. The reward function extends the format reward: > * Finding: Intermediate retrieval rewards did not yield consistent performance improvements and could even degrade performance. > * Implication: The outcome reward appears sufficient to implicitly encourage effective query formulation, as retrieving relevant information is necessary for finding the correct final answer. Explicit intermediate retrieval rewards based on simple metrics like substring match might overly constrain the learning process. > > 2. Underlying LLM Backbone: > > > * LLM Type (General vs. Reasoning-Specialized): The paper compared training agents using general-purpose LLMs (Qwen2.5-Base) and a reasoning-specialized model (DeepSeek-R1-Distill). > * Finding: RL training was more stable and effective when initialized with general-purpose LLMs. Reasoning-specialized LLMs struggled more in the early stages, particularly in learning to initiate search calls. > * Implication: General-purpose LLMs seem to possess sufficient inherent capabilities (including instruction following for tool use) to serve as effective starting points for RL-based agent training, potentially outperforming models specifically fine-tuned for reasoning in tasks requiring external interaction. Instruction-following capability seems crucial for learning to use tools via RL. > > * LLM Scale: Experiments with Qwen2.5 models (3B, 7B, 14B, 32B) were conducted. > * Finding: Performance generally improves with increased model scale, but with diminishing returns on more challenging datasets. > * Implication: While larger models are generally better agents, the gain from scaling up is less pronounced compared to tasks relying purely on parametric knowledge. Effective external information retrieval plays a significant role, reducing the sole reliance on model size. > > 3. Role of the Search Engine: > > > * Training with Different Search Engines: The paper trained agents using search engines of varying quality: Random, BM25 (sparse), and E5 (dense, HNSW/Exact). > * Finding: The quality of the search engine during training strongly impacts learning dynamics. Stronger engines lead to more stable training and higher final performance. Weaker engines result in suboptimal performance, with the agent either avoiding retrieval (Random) or making excessive calls to compensate for poor results (BM25). > * Implication: Using a high-quality search engine during RL training is crucial for developing capable and efficient search agents. > > * Inference with Different Search Engines: Agents trained with different search engines were evaluated using various engines, including Google Search API. > * Finding: LLM search agents trained with a specific engine generalize well to different engines at inference time. Critically, using a stronger search engine during inference consistently leads to improved downstream performance, regardless of the training engine. > * Implication: For practical deployment, it is beneficial to use the best available search engine at inference time, even if training constraints required using a different, potentially weaker, engine. The learned agent policy can effectively leverage higher-quality retrieved information. > > Additional Insights from Appendices: > > > * Long-form QA: Rule-based outcome rewards combined with format rewards are also effective for training agents on long-form, open-ended QA tasks (like ASQA and ELI5), achieving competitive performance compared to RAG or purely reasoning models. > > * Data Scaling: While performance generally improves with training data size, learning stable search behavior requires a minimum amount of data (e.g., 100-1000 samples in the paper's setting). Extremely small datasets can lead to overfitting and failure to learn agentic behaviors. > > Implementation Considerations: > > > * Implementing the format reward requires careful parsing of the LLM's output sequence to check for correct tags and structure. The provided pseudocode/logic in the appendix serves as a template for this validation. > > * The choice of base LLM should consider its instruction-following capabilities alongside its reasoning strength. General-purpose models may be better initializers for tool-use learning via RL. > > * The quality of the search engine infrastructure used during training is critical for efficient and stable learning. Using dense retrieval methods like E5 is recommended. > > * At inference time, prioritizing a higher-quality retriever (even a costly API like Google Search for evaluation) is likely to yield better results. > > * Training requires significant computational resources (e.g., 8 NVIDIA H100 GPUs mentioned for experiments). > > * The framework relies on defining a clear reward function that reflects the task objective. While outcome rewards are effective, designing intermediate rewards is complex and might not be universally beneficial. > > In summary, the paper provides valuable empirical evidence and practical guidance for building RL-trained LLM search agents, highlighting the importance of format rewards, starting with capable base LLMs, and leveraging high-quality search engines during both training and inference.