Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 21 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

s3: You Don't Need That Much Data to Train a Search Agent via RL (2505.14146v1)

Published 20 May 2025 in cs.AI and cs.CL

Abstract: Retrieval-augmented generation (RAG) systems empower LLMs to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces s3, an RL-driven framework that decouples search and generation to enhance retrieval quality with minimal training data.
  • It employs a novel Gain Beyond RAG reward and a selective document retrieval strategy to improve generation accuracy over baseline methods.
  • Experiments demonstrate that s3 achieves superior zero-shot and data-efficient performance on both general and medical QA benchmarks.

The paper "s3: You Don't Need That Much Data to Train a Search Agent via RL" (2505.14146) introduces s3 (Optimized Search-Select-Serve), a lightweight, model-agnostic framework for training a search-only agent in Retrieval-Augmented Generation (RAG) systems using Reinforcement Learning (RL). The core idea is to decouple the searcher LLM from the generator LLM, training only the searcher to improve information retrieval quality for a frozen, potentially black-box, generator. This approach aims to overcome limitations of existing methods that either optimize retrieval using search-only metrics disconnected from downstream utility or fine-tune the entire LLM, entangling retrieval with generation and hindering compatibility.

s3 Framework and Methodology

The s3 framework consists of a tunable searcher LLM (policy πs3\pi_{s3}), a standard search engine (R\mathcal{R}), and a frozen generator LLM (G\mathcal{G}). The process operates as follows:

  1. Initialization: For a given question QQ, s3 first performs a naive RAG step by retrieving top-kk documents D0=R(Q)\mathcal{D}_0 = \mathcal{R}(Q) using the original question as the query q0=Qq_0 = Q. A subset D0selD0\mathcal{D}_0^{\text{sel}} \subseteq \mathcal{D}_0 is selected to form the initial context.
  2. Multi-Turn Search-Select Loop: The searcher then iteratively refines the context through a series of search rounds (t=1,2,,Tt = 1, 2, \dots, T):
    • Query Generation: The searcher LLM emits a query qtq_t formatted as <query>...</query>.
    • Search: The search engine retrieves documents Dt=R(qt)\mathcal{D}_t = \mathcal{R}(q_t), presented to the searcher as <information>...</information>.
    • Select: The searcher selects a subset of useful documents DtselDt\mathcal{D}_t^{\text{sel}} \subseteq \mathcal{D}_t, indicated by <important_info>...</important_info> (e.g., <important_info>[1, 3]</important_info>).
    • Stop Decision: The searcher decides whether to continue searching by outputting <search_complete>[1/0]</search_complete>. The loop terminates if search_complete is True or a turn limit is reached.
  3. Serve: The final accumulated context, Ds3=t=0TDtsel\mathcal{D}_{s3} = \bigcup_{t=0}^{T} \mathcal{D}_t^{\text{sel}}, is passed to the frozen generator LLM G\mathcal{G} to produce the final answer A^=G(Q,Ds3)\hat{A} = \mathcal{G}(Q, \mathcal{D}_{s3}).

Training with Gain Beyond RAG (GBR) Reward

The searcher policy πs3\pi_{s3} is trained using RL with a novel reward signal called Gain Beyond RAG (GBR). GBR measures the improvement in generation accuracy achieved by s3's retrieved context compared to the context from naive top-kk RAG:

GBR(Q)=Acc(G(Q,Ds3),A)Acc(G(Q,DRAG),A)\text{GBR}(Q) = \text{Acc}(\mathcal{G}(Q, \mathcal{D}_{s3}), A) - \text{Acc}(\mathcal{G}(Q, \mathcal{D}_{\text{RAG}}), A)

where AA is the gold-standard answer, DRAG\mathcal{D}_{\text{RAG}} is the set of documents retrieved by a simple top-kk RAG using the original question, and Acc()\text{Acc}(\cdot) is a generation accuracy metric.

To improve training efficiency, the baseline accuracy term Acc(G(Q,DRAG),A)\text{Acc}(\mathcal{G}(Q, \mathcal{D}_{\text{RAG}}), A) is precomputed. Training is focused on examples where this baseline accuracy is 0, meaning s3 learns on "hard" queries where naive RAG fails.

The search policy is optimized using Proximal Policy Optimization (PPO). During training, the generator LLM G\mathcal{G} remains frozen, and gradients are backpropagated only through the searcher policy πs3\pi_{s3}.

Evaluation Metric: Generation Accuracy (GenAcc)

The paper employs a custom metric, Generation Accuracy (GenAcc), to evaluate answer quality. For a prediction pp and gold answers A\mathcal{A}:

GenAcc=span_checkjudge_check\text{GenAcc} = \text{span\_check} \lor \text{judge\_check}

  1. span_check: This performs a normalized string comparison. If any gold answer (after lowercasing, removing punctuation/articles) is a token span within the normalized prediction, GenAcc is 1.
  2. judge_check: If span_check fails, an LLM is prompted: "Does pp contain any of A\mathcal{A}? Directly answer with 'yes' or 'no'." If the LLM answers 'yes', GenAcc is 1. This metric is designed to be more robust than exact match (EM) and aligns better with human judgment (96.4% agreement in a paper).

Experimental Setup and Implementation Details

  • Searcher LLM: Qwen2.5-7B-Instruct is used for training the s3 searcher.
  • Generator LLM (for GBR computation & final answer): Qwen2.5-14B-Instruct (GPTQ-Int4 version) is used during training for reward calculation and answer generation. For evaluation, experiments test with Qwen2.5-7B/14B-Instruct and Claude-3-Haiku as frozen generators.
  • Judge LLM (for GenAcc): Qwen2.5-14B-Instruct for training, Claude-3-Haiku for evaluation.
  • Retriever: E5-base-v2 with Wikipedia-2018 as the primary corpus. Medical QA also uses a Wikipedia+PubMed+Textbook corpus.
  • Training Data: 2.4k samples derived from NQ and HotpotQA, specifically focusing on examples where naive RAG fails. This is significantly less than baselines like Search-R1 (170k) and DeepRetrieval (70k).
  • RL Training: Implemented using the VERL framework and RAGEN architecture, on 5 NVIDIA A100 80GB GPUs. PPO is used with specific hyperparameters (batch size 120, actor/critic LRs, etc.). vLLM is used for efficient LLM inference.
  • Baselines: Compared against end-to-end fine-tuned models (Search-R1), static retrieval methods (RAG-BM25, RAG-E5), and active retrieval methods (IRCoT, Search-o1, DeepRetrieval).

Key Results and Findings

  1. High Performance with Data Efficiency: s3, trained on only 2.4k examples, consistently outperformed baselines (including those trained on over 70x more data) across six general QA and five medical QA benchmarks. For instance, with Claude-3-Haiku as the generator, s3 achieved an average GenAcc of 58.9% on general QA, surpassing Search-R1-7B (Ret) (57.8%) and IRCoT-14B (54.7%).
  2. Superiority of Searcher-Only Optimization: The results suggest that decoupling the searcher and generator and focusing RL on the searcher (Takeaway #1) is more effective for RAG than end-to-end fine-tuning or optimizing for retrieval metrics alone. s3's search quality improvements directly translate to better downstream generation.
  3. Domain Transferability: s3, trained solely on general domain QA data (NQ, HotpotQA), demonstrated strong zero-shot performance on medical QA datasets (MIRAGE benchmark), outperforming methods specifically designed or prompted for complex reasoning (Takeaway #2). For example, on the Wikipedia+PubMed+Textbook corpus, s3 achieved 76.6% average GenAcc.
  4. Training Efficiency: s3 converges rapidly, requiring only ~20 PPO steps (2.4k examples). While per-step time is higher due to LLM-based reward computation (5.7m/step), the total training time (114m) is significantly lower than Search-R1 (~3,780m).
  5. Importance of Reward Function: The GenAcc metric proved crucial (Takeaway #3). Compared to using EM or simple span-based rewards for GBR, GenAcc led to better search policies and final RAG performance, aligning closely with human evaluation.
  6. Ablation Studies:
    • "Begin with Search" (initializing with QQ): This component is critical. Removing it consistently led to a significant performance drop.
    • Document Selection: While removing the selection step (i.e., passing all retrieved documents) sometimes yielded slightly better results on specific datasets, the full s3 with selection drastically reduced input token usage (2.6x ~ 4.2x fewer tokens), improving overall efficiency.

Practical Implementation Considerations

  • Modular Design: s3's decoupling allows practitioners to use their preferred frozen generator LLMs (including proprietary ones) without needing to fine-tune them. The searcher can be a smaller, specialized LLM.
  • Computational Cost of GBR: The main computational cost during training is the LLM inference required for GBR calculation (generating an answer with G\mathcal{G} and potentially using judge_check). However, the extreme data efficiency (few training steps) mitigates this.
  • Prompt Engineering: The specific XML-like tags (<query>, <information>, <important_info>, <search_complete>) define the interaction protocol for the searcher LLM. The prompts provided in the appendix (Figures \ref{fig:search_prompt}, \ref{fig:answer_prompt}, \ref{fig:answer_check}) are key to implementing the system.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    // Pseudocode for s3 Search-Select Loop turn
    function s3_search_turn(current_context, question, searcher_policy, search_engine):
      // 1. Query Generation
      prompt_for_query = format_prompt(current_context, question)
      searcher_output = searcher_policy.generate(prompt_for_query)
      query_str = extract_between_tags(searcher_output, "<query>", "</query>")
      
      // 2. Search
      retrieved_docs_raw = search_engine.search(query_str)
      retrieved_docs_formatted = format_docs_for_LLM(retrieved_docs_raw) // e.g., "Doc 1 (Title: X): Content Y"
      
      // (Implicitly, searcher_output now includes <information>retrieved_docs_formatted</information>)
      
      // 3. Select
      selected_doc_indices_str = extract_between_tags(searcher_output, "<important_info>", "</important_info>")
      selected_doc_indices = parse_indices(selected_doc_indices_str) // e.g., [1, 3]
      selected_docs = filter_docs_by_indices(retrieved_docs_raw, selected_doc_indices)
      
      // 4. Stop decision
      stop_decision_str = extract_between_tags(searcher_output, "<search_complete>", "</search_complete>")
      stop_decision = (stop_decision_str == "True" or stop_decision_str == "1")
      
      return selected_docs, stop_decision
  • Resource Requirements: Training requires access to multiple GPUs (5x A100s in the paper) and a robust RL training framework (like VERL/RAGEN). Inference with the trained searcher and frozen generator is less demanding but still involves LLM calls.

Limitations

  • Dependency on Frozen Generator Quality: The effectiveness of s3 relies on the chosen frozen generator's ability to utilize the improved context.
  • Reward Estimation Bottleneck: LLM-based reward calculation (GenAcc) is computationally more expensive per step than simpler rewards, though overall training is fast due to few steps.
  • Bias Propagation: Like other RAG systems, s3 can inherit biases from the searcher, generator, and the retrieval corpus.

In conclusion, s3 offers a promising and highly data-efficient method for improving the retrieval component of RAG systems by training a specialized search agent with a novel, generation-aware reward signal. Its modularity and compatibility with frozen LLMs make it a practical approach for enhancing RAG performance without extensive LLM fine-tuning.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com