REX-RAG: Policy Correction in RAG

Updated 4 July 2026

REX-RAG is a reinforcement learning framework for retrieval-augmented generation that overcomes dead-end reasoning through mixed sampling and prompt-guided probes.
The approach incorporates a policy correction mechanism using importance sampling to adjust for off-policy data and maintain gradient accuracy.
Empirical evaluations show significant gains in multi-hop QA, improving average exact match scores by up to 5.1 points on benchmark datasets.

REX-RAG, short for Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation, is a reinforcement-learning framework for retrieval-augmented generation that targets a specific failure mode in RL-trained reasoning agents: the tendency to become trapped in dead-end reasoning trajectories. It introduces a Mixed Sampling Strategy that explores alternative reasoning continuations by inserting prompt-guided probes into failed rollouts, and a Policy Correction Mechanism that uses importance sampling to compensate for the resulting off-policy distribution shift during training. The framework is built on top of GRPO-style RL for search-enabled question answering and was evaluated on seven open-domain QA benchmarks with reported average gains of 5.1 points on Qwen2.5-3B and 3.6 points on Qwen2.5-7B over strong baselines (Jiang et al., 11 Aug 2025).

1. Problem formulation and motivation

REX-RAG is motivated by the observation that, during policy-driven rollout collection, LLMs often commit too early to incorrect reasoning paths, yielding trajectories that repeatedly end in the wrong answer. The paper calls these failure modes dead ends: situations where, across multiple rollouts for a question, the model remains stuck in unproductive solution paths and fails to discover a correct alternative. On Qwen2.5-3B, the reported dead-end incidence under self-reflection baselines exceeds 85% early in RL training, which the paper presents as direct evidence that ordinary on-policy exploration is pathologically narrow (Jiang et al., 11 Aug 2025).

The underlying setting is RL with verifiable rewards for search-enabled question answering. For each question $q$ , the model alternates between reasoning, issuing search queries, receiving retrieved evidence, continuing reasoning, and generating a final answer. A trajectory therefore contains reasoning tokens, tool-call tokens, retrieved-information insertions, and answer tokens. The environment includes a dataset $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ , an external retriever $\mathcal{R}$ , and structured interaction tokens for search and reasoning. The reward is binary exact match,

$r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$

REX-RAG’s core claim is that RL-enhanced RAG fails not only because rewards are sparse, but because exploration repeatedly revisits locally coherent but incorrect reasoning modes. A plausible implication is that, in search-enabled RAG, exploration quality becomes a first-order optimization problem rather than a secondary sampling detail.

2. Mixed Sampling Strategy

The first major component is the Mixed Sampling Strategy. Instead of collecting only on-policy rollouts from the current policy $\pi_\theta$ , REX-RAG mixes ordinary rollouts with additional probe trajectories obtained by modifying failed trajectories. The mixed behavior policy is written as

$\mu = \{\pi_\theta, \pi_\varepsilon\}.$

For each question, the method first samples $n$ trajectories with rewards $\{r_1,\dots,r_n\}$ . Each trajectory is then resampled with probability

$p(1-r_i),$

where $p \in [0,1]$ controls exploration intensity. Because the reward is exact match, correct trajectories are never resampled and incorrect trajectories are resampled with probability $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 0. The expected number of resampled trajectories is

$\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 1

A probe trajectory is constructed by preserving the useful prefix of a failed rollout, inserting an exploratory prompt, and then continuing generation with the target model: $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 2 Here, $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 3 is the original rollout prefix up to the failed point, $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 4 is an exploratory prompt drawn from a prompt pool $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 5, and $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 6 is a new continuation generated by $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 7.

The exploratory prompt pool is built by rephrasing a reflection prompt into multiple chain-of-thought-like fragments using GPT-4.5. The appendix lists 30 such prompts. The paper’s interpretation is that probe sampling is more forceful than ordinary self-reflection because it preserves useful context but deliberately inserts a cue that increases the probability of branching into a qualitatively different continuation. In the appendix, REX-RAG is reported to obtain substantial gains with only about 12% additional sampling, whereas simply increasing Search-R1 rollouts by 20% yields almost no improvement (Jiang et al., 11 Aug 2025).

3. Policy correction and corrected GRPO optimization

Because some training trajectories now come from the probe policy $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 8, rollout data are no longer purely on-policy. REX-RAG therefore introduces a Policy Correction Mechanism to reduce gradient bias caused by mixed sampling. The paper identifies the main failure mode of naive optimization as a distribution mismatch that can overweight inserted prompt spans and distort token-level gradients (Jiang et al., 11 Aug 2025).

The first correction stage is trajectory filtering. Probe trajectories are retained only if they remain reasonably compatible with the current policy. For question $\mathcal{D}=\{(q_i,a_i)\}_{i=1}^n$ 9, the retained set is

$\mathcal{R}$ 0

where $\mathcal{R}$ 1 controls the retention ratio of probe trajectories relative to the original group size $\mathcal{R}$ 2.

The second stage is multiple importance sampling with the balance heuristic. If the fractions of trajectories from $\mathcal{R}$ 3 and $\mathcal{R}$ 4 are $\mathcal{R}$ 5 and $\mathcal{R}$ 6, respectively, then

$\mathcal{R}$ 7

The per-token importance ratio is

$\mathcal{R}$ 8

The probe policy itself is defined segmentwise. Origin tokens are treated as sampled from a truncated version of $\mathcal{R}$ 9, prompt tokens are assigned probabilities from an empirical prefix-conditioned PMF built from the prompt pool, and probe-continuation tokens are sampled directly from $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 0. The PMF is constructed by tokenizing every prompt in $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 1, collecting next-token frequencies for each prefix, and defining

$r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 2

REX-RAG then replaces the ordinary GRPO ratio with $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 3 in a clipped objective: $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 4 The paper gives a theoretical justification via off-policy importance weighting and balance-heuristic MIS, but it does not present a formal convergence theorem (Jiang et al., 11 Aug 2025).

4. Retrieval protocol and experimental setting

REX-RAG is instantiated as a search-enabled RAG agent. The model uses the structured interaction tokens

> ... for internal reasoning,
<search> ... </search> for query generation,
<information> ... </information> for returned evidence,
<answer> ... </answer> for final answers.

Retrieval is fixed rather than learned end to end. The system uses the December 2018 Wikipedia dump as its knowledge source, E5-base-v2 as the retriever, and FAISS as the retrieval backend. Each search step returns the top-3 documents. The maximum number of search turns is 5, increased from 2 used by Search-R1, and the response length is 500 tokens (Jiang et al., 11 Aug 2025).

Training is performed on a merged training set of NQ and HotpotQA, while evaluation covers seven benchmarks: NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle. The backbone models are Qwen2.5-3B and Qwen2.5-7B. The RL framework is VERL, and the reported hardware is 8× NVIDIA A800 80GB. Key hyperparameters include batch size 512, mini-batch size 256, maximum token length 24,000, actor learning rate $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 5, clip ratio 0.2, KL coefficient 0.001, and default REX hyperparameters $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 6 and $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 7 (Jiang et al., 11 Aug 2025).

The advantage signal follows standard GRPO group normalization. For a group of $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 8 outputs with rewards $r = \mathrm{EM}(\text{ans}_{\mathrm{pred}}, \text{ans}_{\mathrm{gold}}).$ 9,

$\pi_\theta$ 0

$\pi_\theta$ 1

All tokens in one output therefore share the same advantage.

5. Empirical performance

The headline empirical result is an improvement over Search-R1-instruct on both model scales. On Qwen2.5-3B, REX-RAG reaches 38.7 average EM versus 33.6 for Search-R1-instruct, a gain of 5.1 points. On Qwen2.5-7B, it reaches 43.2 versus 39.6, a gain of 3.6 points (Jiang et al., 11 Aug 2025).

The gains are especially pronounced on multi-hop reasoning tasks. On Qwen2.5-3B, REX-RAG improves 2Wiki from 31.0 to 39.7 and HotpotQA from 33.1 to 37.4. On Qwen2.5-7B, it raises 2Wiki from 34.6 to 43.7, HotpotQA from 38.6 to 42.2, MuSiQue from 16.2 to 19.7, and Bamboogle from 40.0 to 44.8. The paper interprets this pattern as evidence that escaping dead-end reasoning is particularly valuable for compositional and multi-step QA (Jiang et al., 11 Aug 2025).

Setting	Baseline	REX-RAG
Qwen2.5-3B average EM	33.6	38.7
Qwen2.5-7B average EM	39.6	43.2
Qwen2.5-3B, 2Wiki	31.0	39.7
Qwen2.5-7B, 2Wiki	34.6	43.7

On general QA tasks such as TriviaQA and NQ, gains remain positive but are smaller. This suggests that the failure mode targeted by REX-RAG is more acute when successful answering depends on multiple retrieval and reasoning steps rather than a single retrieval-supported factual lookup.

6. Position within agentic RAG research

REX-RAG belongs to the broader family of agentic RAG systems that interleave retrieval and reasoning, but its emphasis is specifically on RL trajectory exploration rather than on specialist-tool orchestration or offline process supervision. In adjacent work, DecEx-RAG models agentic RAG as an MDP with state $\pi_\theta$ 2 and action $\pi_\theta$ 3, explicitly separating decision optimization from execution optimization and constructing process supervision through rollout-based search before applying SFT + DPO (Leng et al., 7 Oct 2025). By contrast, REX-RAG retains a GRPO-style RL formulation and addresses the narrower problem of dead-end exploration through mixed sampling plus off-policy correction.

A different neighboring design appears in CyberRAG, where a central LLM agent orchestrates specialized attack-family classifiers, dense retrieval over curated cyber knowledge bases, and iterative retrieval-and-reason loops for semantic validation and reporting (Blefari et al., 3 Jul 2025). Relative to that pattern, REX-RAG is not a specialist routing system. Its retriever is fixed, its tool space is centered on search-enabled QA, and its novelty lies in how rollout data are generated and corrected during RL rather than in how heterogeneous tools are orchestrated.

A plausible implication is that REX-RAG occupies a distinct subspace of agentic RAG research: it is best understood as a policy-learning framework for search trajectories, not as a general-purpose orchestration architecture.

7. Ablations, limitations, and interpretation

The ablation study identifies the Policy Correction Mechanism as essential. On Qwen2.5-3B, full REX-RAG reaches 38.7 average EM. Replacing the probe-policy definition with Coarse PPD lowers performance to 36.4. Removing importance sampling lowers it to 33.4. Removing trajectory filtering causes the largest collapse, to 28.2. Removing the full correction pipeline yields 29.1. These results support the paper’s claim that exploratory probe trajectories are useful only when accompanied by principled filtering and reweighting (Jiang et al., 11 Aug 2025).

Prompt diversity also matters. The appendix reports 31.2 for Search-R1, 32.1 for REX-RAG with 5 prompts, and 38.7 for REX-RAG with 30 prompts. The method also appears partially optimizer-agnostic: with DAPO, Search-R1 reaches 34.8 and REX-RAG reaches 38.4; with GRPO, the corresponding values are 31.2 and 38.7 (Jiang et al., 11 Aug 2025).

The paper states three main limitations. First, exploration uses a fixed prompt pool, not dynamically learned prompts. Second, it introduces computational overhead from two-stage sampling and token-level importance correction. Third, it is only validated on RAG-style QA, not on broader agentic tasks. The authors also note that no strong formal convergence guarantee is provided. Taken together, these caveats position REX-RAG as a methodologically specific advance: it demonstrates that explicit exploration engineering, combined with principled off-policy correction, can materially improve RL-trained retrieval-augmented reasoning, especially on multi-hop QA (Jiang et al., 11 Aug 2025).