Papers
Topics
Authors
Recent
Search
2000 character limit reached

RAG-Gym: Unified Optimization for Agentic RAG

Updated 22 June 2026
  • RAG-Gym is a comprehensive platform for optimizing agentic RAG by unifying retrieval, reasoning, and fine-grained process reward supervision.
  • It employs a nested Markov decision process that integrates iterative query formulation and answer generation with advanced prompt engineering and actor tuning techniques.
  • Empirical evaluations with Llama-3.1-8B-Instruct demonstrate significant F1 improvements, highlighting its practical impact on multi-hop question answering tasks.

Retrieval-augmented generation (RAG) systems enhance LLMs by integrating external knowledge through retrieval mechanisms; however, conventional RAG approaches often employ single, static retrieval steps and lack adaptive strategies for complex, multi-hop reasoning. Agentic RAG advances this paradigm by enabling language agents to engage in multi-round, interleaved interactions with external knowledge sources, but existing systems generally rely on heuristic-driven prompt engineering and have no unified optimization framework. RAG-Gym addresses these limitations by introducing a comprehensive platform for systematic optimization of agentic RAG, centered on unified Markov decision process (MDP) formulation, fine-grained process reward supervision, and three orthogonal optimization dimensions: prompt engineering, actor tuning, and critic training (Xiong et al., 19 Feb 2025).

1. Formal Structure: Nested Markov Decision Process Design

RAG-Gym formulates knowledge-intensive question answering (QA) as a two-level nested MDP. The outer MDP’s state at time tt, st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t), encapsulates the original question Q\mathcal{Q} and a retrieval history Ht={(qi,Di)}i<t\mathcal{H}_t=\{(q_i, D_i)\}_{i<t} comprising all prior search queries and their corresponding documents. The action space A=AqAp\mathcal{A} = \mathcal{A}_q \cup \mathcal{A}_p includes both the issuance of new search queries qtAqq_t \in \mathcal{A}_q and submission of a final answer pApp \in \mathcal{A}_p. On querying, the retrieval environment IR(qt)\mathrm{IR}(q_t) transitions the system state by returning DtD_t; outputting an answer terminates the episode.

RAG-Gym’s reward structure consists of sparse, outcome reward—typically zero for intermediate steps and either F1 or accuracy for the final answer—and dense, process rewards at each outer-MDP step. Process rewards assess each high-level agent action for sufficiency (necessity), utility (precision), and novelty (non-redundancy), enabling fine-grained, stepwise supervision. State–action pairs (st,at)(s_t, a_t) are annotated via a ranking process leveraging GPT-4o, shown to provide human-level annotation fidelity.

2. Prompt Engineering: The ReSearch and Rest=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)0Search Agents

A central innovation in RAG-Gym is the ReSearch agent architecture, which tightly interleaves reasoning and retrieval at every decision. Each step in ReSearch consists of:

  1. History summarization: Retrieved documents st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)1 are distilled into concise summaries st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)2 to form st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)3.
  2. Structured reasoning: The agent generates a partial answer, explicitly identifying any “unverified claims,” i.e., factual assertions not yet supported by the current retrieval corpus.
  3. Query formulation: The first unverified claim is directly translated into a new targeted query for retrieval.

This “reason-then-reflect” prompting conditions the agent to generate retrieval steps that are explicitly motivated by identified knowledge deficits, rather than by ad hoc turn-taking heuristics. Empirical evaluation in zero-shot settings (Llama-3.1-8B) demonstrates that ReSearch outperforms Search-o1 by over 5 F1 points (54.7 vs. 51.8), indicating the superiority of this coupled approach to reasoning and query generation.

Building on ReSearch, RAG-Gym introduces further enhancements in Rest=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)4Search++ through systematic optimization along all three available dimensions, achieving consistently higher downstream performance than contemporary methods such as Search-R1.

3. Actor Tuning: Supervised Fine-Tuning and Direct Preference Optimization

RAG-Gym supports two principal approaches for post-training policy improvement:

  • Supervised Fine-Tuning (SFT): The policy st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)5 is optimized on a process-annotated dataset by minimizing the negative log-likelihood loss st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)6, where only preferred actions st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)7 receive loss credit.
  • Direct Preference Optimization (DPO): Preference pairs st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)8, representing superior and inferior actions at state st=(Q,Ht)s_t=(\mathcal{Q}, \mathcal{H}_t)9, are used to train the policy via the contrastive loss:

Q\mathcal{Q}0

with temperature Q\mathcal{Q}1 and fixed reference policy Q\mathcal{Q}2.

Empirical findings reveal that DPO delivers greater improvements, particularly for agents that must alternate between reasoning and querying (e.g., ReAct, Search-o1, ReSearch). DPO’s contrastive structure not only reinforces desirable behaviors but directly penalizes undesired ones, effectively suppressing low-utility or redundant queries—a crucial property for agentic retrieval scenarios.

4. Critic Training: Process Reward Modeling and Inference-Time Best-of-N Selection

RAG-Gym also enables training a separate reward model (critic) Q\mathcal{Q}3 using process-annotated preference pairs. The critic is optimized via the pairwise cross-entropy loss:

Q\mathcal{Q}4

During inference, candidate actions Q\mathcal{Q}5 are sampled from the base LLM policy, scored by the critic, and the action maximizing Q\mathcal{Q}6 is selected. This “best-of-Q\mathcal{Q}7” inference strategy does not require access to agent model weights and is compatible with black-box or proprietary LLM systems.

Algorithmically, PRM-guided inference consists of repeated candidate generation and critic-based ranking per decision step, terminating when an answer action is selected.

5. Experimental Evaluation and Scaling Laws

All benchmark evaluations deploy Llama-3.1-8B-Instruct, Wikipedia and medical corpora (for MedQA) as retrieval backends, and Reciprocal Rank Fusion using BM25 and BGE embeddings. The following table summarizes core results (average F1 across HotpotQA, 2Wiki, Bamboogle, MedQA):

Method Average F1 Relative Gain
Zero-Shot Search-o1 51.8 -
RAG-Gym + SFT (Search-o1) 55.2 +6.6 rel. F1
RAG-Gym + DPO (Search-o1) 58.2 +12.4 rel. F1
RAG-Gym + PRM (ReSearch) 62.4 +20.6 rel. F1

Improvements via DPO tuning range from 3.2%–11.6% relative to Search-o1, while PRM applied to ReSearch achieves an F1 of 62.41%, representing a 25.6% relative gain over the zero-shot ReAct baseline.

On MedQA, PRM annotation sources are compared:

  • Outcome reward (final accuracy): 66.8%
  • Rollout-based PRM: 68.3% (expert agreement: 71.0%)
  • GPT-4o annotated PRM: 71.96% (expert agreement: 85.9%)

GPT-4o-based process reward annotations provide the highest expert alignment and downstream QA accuracy. Training efficiency experiments show that PRM requires as few as 250 annotated trajectories to achieve most of the attainable improvement, with additional gains for MedQA up to 1000 samples. Best-of-Q\mathcal{Q}8 sampling improves F1 by 4–6 points for Q\mathcal{Q}9 up to 10, with diminishing returns beyond Ht={(qi,Di)}i<t\mathcal{H}_t=\{(q_i, D_i)\}_{i<t}0.

6. Applied Recommendations and Practical Implications

RAG-Gym’s systematic optimization framework yields several actionable guidelines for practitioners:

  • Explicitly couple reasoning and retrieval within agent prompts in the style of ReSearch to maximize query utility.
  • Leverage advanced LLM annotators such as GPT-4o to obtain high-fidelity process rewards for a few hundred trajectories to drive process-level supervision.
  • Apply DPO for actor tuning when access to model weights is available; otherwise, deploy PRM-based critic selection with best-of-Ht={(qi,Di)}i<t\mathcal{H}_t=\{(q_i, D_i)\}_{i<t}1 inference for black-box systems.

This systematic methodology transforms previously ad hoc agentic RAG systems into unified, high-performance agents suitable for open-domain and domain-intensive tasks. Empirically, RAG-Gym demonstrates up to 25.6% relative F1 improvements on multi-hop QA, while also establishing clear scaling laws and annotation cost–effectiveness (Xiong et al., 19 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RAG-Gym.