- The paper shows that incorporating format rewards boosts RL training performance and accelerates convergence for LLM agents using general-purpose models.
- It finds that general-purpose LLM backbones enable more stable search interactions compared to reasoning-specialized models.
- The study reveals that high-quality search engines are essential for achieving stable and efficient agent performance during training and inference.
This paper, "An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents" (2505.15117), explores the use of Reinforcement Learning (RL) to train LLMs to act as agents that can interleave reasoning steps with calls to external search engines. This approach aims to overcome limitations of purely prompt-based or Supervised Fine-Tuning (SFT) methods for building such agents, particularly the need for costly, manually annotated intermediate trajectories in SFT.
The core concept involves training an LLM policy (πθ) to interact with a search engine (R) to solve problems. The agent follows a loop: reasoning → search (query formulation) → context (retrieved info) → reasoning → ... → final answering. The training uses an RL objective to maximize a reward function based on the outcome. The paper conducts empirical studies to understand how three key factors influence this RL training process:
- Reward Formulation: How different types of rewards beyond just the final outcome affect training.
- Underlying LLM Backbone: The impact of the base LLM's characteristics (type and scale).
- Role of the Search Engine: How the search engine used during training and inference influences the agent.
Key Findings and Practical Implications
The paper provides several actionable insights for practitioners implementing RL-based search agents:
1. Reward Formulation:
- Format Reward: The paper investigates adding a reward for adhering to the required interaction format (e.g., using specific tokens like
>
, <search>
, <information>
, <answer>
). The format reward function used is:
> * Finding: Incorporating a format reward significantly improves final performance, especially when training from a base LLM (vs. an instruction-tuned one) and accelerates RL convergence.
> * Implication: For training agents, particularly from models less proficient at following complex instructions, explicitly rewarding correct interaction format is highly beneficial. Tuning the λf parameter is important, as too small a value is ineffective, and too large can lead to overfitting to the format over the task outcome.
>
> * Intermediate Retrieval Reward: The paper explores adding a reward component (λr) for retrieving documents containing the ground truth answer, even if the final answer is incorrect. This aims to incentivize generating good search queries. The reward function extends the format reward:
> * Finding: Intermediate retrieval rewards did not yield consistent performance improvements and could even degrade performance.
> * Implication: The outcome reward appears sufficient to implicitly encourage effective query formulation, as retrieving relevant information is necessary for finding the correct final answer. Explicit intermediate retrieval rewards based on simple metrics like substring match might overly constrain the learning process.
>
> 2. Underlying LLM Backbone:
>
>
> * LLM Type (General vs. Reasoning-Specialized): The paper compared training agents using general-purpose LLMs (Qwen2.5-Base) and a reasoning-specialized model (DeepSeek-R1-Distill).
> * Finding: RL training was more stable and effective when initialized with general-purpose LLMs. Reasoning-specialized LLMs struggled more in the early stages, particularly in learning to initiate search calls.
> * Implication: General-purpose LLMs seem to possess sufficient inherent capabilities (including instruction following for tool use) to serve as effective starting points for RL-based agent training, potentially outperforming models specifically fine-tuned for reasoning in tasks requiring external interaction. Instruction-following capability seems crucial for learning to use tools via RL.
>
> * LLM Scale: Experiments with Qwen2.5 models (3B, 7B, 14B, 32B) were conducted.
> * Finding: Performance generally improves with increased model scale, but with diminishing returns on more challenging datasets.
> * Implication: While larger models are generally better agents, the gain from scaling up is less pronounced compared to tasks relying purely on parametric knowledge. Effective external information retrieval plays a significant role, reducing the sole reliance on model size.
>
> 3. Role of the Search Engine:
>
>
> * Training with Different Search Engines: The paper trained agents using search engines of varying quality: Random, BM25 (sparse), and E5 (dense, HNSW/Exact).
> * Finding: The quality of the search engine during training strongly impacts learning dynamics. Stronger engines lead to more stable training and higher final performance. Weaker engines result in suboptimal performance, with the agent either avoiding retrieval (Random) or making excessive calls to compensate for poor results (BM25).
> * Implication: Using a high-quality search engine during RL training is crucial for developing capable and efficient search agents.
>
> * Inference with Different Search Engines: Agents trained with different search engines were evaluated using various engines, including Google Search API.
> * Finding: LLM search agents trained with a specific engine generalize well to different engines at inference time. Critically, using a stronger search engine during inference consistently leads to improved downstream performance, regardless of the training engine.
> * Implication: For practical deployment, it is beneficial to use the best available search engine at inference time, even if training constraints required using a different, potentially weaker, engine. The learned agent policy can effectively leverage higher-quality retrieved information.
>
> Additional Insights from Appendices:
>
>
> * Long-form QA: Rule-based outcome rewards combined with format rewards are also effective for training agents on long-form, open-ended QA tasks (like ASQA and ELI5), achieving competitive performance compared to RAG or purely reasoning models.
>
> * Data Scaling: While performance generally improves with training data size, learning stable search behavior requires a minimum amount of data (e.g., 100-1000 samples in the paper's setting). Extremely small datasets can lead to overfitting and failure to learn agentic behaviors.
>
> Implementation Considerations:
>
>
> * Implementing the format reward requires careful parsing of the LLM's output sequence to check for correct tags and structure. The provided pseudocode/logic in the appendix serves as a template for this validation.
>
> * The choice of base LLM should consider its instruction-following capabilities alongside its reasoning strength. General-purpose models may be better initializers for tool-use learning via RL.
>
> * The quality of the search engine infrastructure used during training is critical for efficient and stable learning. Using dense retrieval methods like E5 is recommended.
>
> * At inference time, prioritizing a higher-quality retriever (even a costly API like Google Search for evaluation) is likely to yield better results.
>
> * Training requires significant computational resources (e.g., 8 NVIDIA H100 GPUs mentioned for experiments).
>
> * The framework relies on defining a clear reward function that reflects the task objective. While outcome rewards are effective, designing intermediate rewards is complex and might not be universally beneficial.
>
> In summary, the paper provides valuable empirical evidence and practical guidance for building RL-trained LLM search agents, highlighting the importance of format rewards, starting with capable base LLMs, and leveraging high-quality search engines during both training and inference.