R3: Rollout Routing Replay in Reinforcement Learning

Updated 14 October 2025

Rollout Routing Replay (R3) is a mechanism that records and replays inference routing masks to align training and inference in reinforcement learning.
R3 enforces consistent expert selections, significantly reducing KL divergence and extreme token probability deviations to stabilize training.
The method accelerates policy learning and enhances sample efficiency in both MoE models and large language model fine-tuning applications.

Rollout Routing Replay (R3) denotes a class of mechanisms for synchronizing decision pathways or leveraging prior successful experiences in reinforcement learning and combinatorial optimization. While the abbreviation R3 has appeared in several disparate contexts, a recent and prominent application within Mixture-of-Experts (MoE) reinforcement learning is to stabilize training by ensuring consistent expert selections between inference and training phases. R3 also describes replay-based methods that improve policy learning efficiency and reduce computational overhead in LLM fine-tuning and on-policy control. R3 techniques share a unifying design principle: they record and reuse critical routing or trajectory data to align behavior or enhance sample efficiency under policy drift or nonstationary conditions.

1. Definition and Motivations

In MoE reinforcement learning, R3 refers specifically to the method that records the routing (expert selection) made by the inference engine during rollout and replays these same routing decisions in the training forward pass (Ma et al., 13 Oct 2025). This approach directly addresses a documented inconsistency between router outputs generated in inference versus training, where nondeterminism and architectural idiosyncrasies can cause dramatically divergent expert selections for identical inputs. The resulting policy and value distributions exhibit measurable KL divergence and “extreme” token probability ratios, destabilizing RL and in some instances causing catastrophic collapse of training.

Elsewhere, R3 encompasses replay-buffer methodologies in tabular and neural policy learning. These use cyclic or FIFO buffers of high-reward trajectories or routing records to accelerate learning and stabilize optimization (notably in PPO, GRPO, and their derivatives) (Li et al., 26 May 2024, Sun et al., 5 Jun 2025).

2. Mechanism of R3 in MoE Reinforcement Learning

In MoE models, each input is routed (via a trainable router) to a subset of K experts. Standard training computes the router logits $s_{\text{train}} = x_{\text{train}} \cdot W_r$ , applies a TopKMask to produce a binary selection $I_{\text{train}}$ , and then forms expert-weighted outputs via a softmax on the gated logits. However, inference engines (such as SGLang) or even distinct forward passes in the same engine may yield different masks $I_{\text{infer}}$ for the same $x$ due to minute input changes or stochastic effects.

R3 mandates that, during training, the gating weights are computed using the inference-generated mask: $g_{\text{replay},i} = \frac{I_{\text{infer},i} \cdot e^{s_{\text{train},i}}}{\sum_{j} I_{\text{infer},j} \cdot e^{s_{\text{train},j}}}$ and the output is aggregated: $y_{\text{replay}} = \sum_{i} g_{\text{replay},i} \cdot E_i(x_{\text{train}})$ This preserves router gradient flow while enforcing exact agreement with inference-time expert selection. Empirically, this replay mechanism reduces KL divergence between training and inference probability distributions (from about $1.5 \times 10^{-3}$ to $7.5 \times 10^{-4}$ ), matching the divergence levels observed in non-MoE (dense) models, and also lowers the occurrence of extreme token probability discrepancies by an order of magnitude.

Previous efforts to address the train/test divergence in RL for MoE architectures include Group Sequence Policy Optimization (GSPO) and Truncated Importance Sampling (TIS) (Ma et al., 13 Oct 2025). GSPO applies sequence-level importance sampling corrections, while TIS clips extreme policy ratios to constrain update variance. Both act downstream of routing decisions and attempt to regularize mismatched probability distributions. R3, by contrast, targets the source of mismatch—routing variability—by enforcing routing mask alignment. This leads to improved stability and mitigates RL collapse without the need for heavy off-policy corrections.

Replay mechanisms outside MoE RL (rewarded region replay, rollout replay, etc.) use buffers of high-reward trajectories, replaying them to the current policy and correcting for off-policy drift via importance sampling (Li et al., 26 May 2024, Sun et al., 5 Jun 2025). These do not address routing non-determinism but rather aim to improve sample efficiency and learning speed by reusing the most informative experiences. Truncation of importance ratios or buffer selection heuristics are further applied to control update variance.

4. Implementation Considerations

The standard R3 approach in MoE RL requires capturing routing masks from the inference engine for each token and replaying them synchronously in the corresponding training step. This demands infrastructure for mask transfer between rollout and training engines (e.g., SGLang to Megatron) and careful handling to ensure per-token alignment. The replayed mask is used in the softmax gating calculation but does not prevent gradient flow through the standard router weights, ensuring that the router remains trainable. Batch-level synchronization is paramount; divergence between masks and input batches can corrupt updates or reintroduce instability.

For replay-based buffer methods, cyclic storage and threshold-based inclusion criteria are used. Off-policy corrections via truncated importance sampling are required to preserve update fidelity.

5. Empirical Impacts and Benchmark Results

Experimental studies demonstrate that R3 achieves significant performance and stability improvements in MoE RL. KL divergence between training and inference is reduced by almost 50%, reaching parity with dense models on this metric (Ma et al., 13 Oct 2025). The extreme token distribution drops by an order of magnitude, indicating lower frequency of catastrophic policy mismatch. Training curves show smoother progression, reduced gradient norm variability, steadier entropy increase, and consistent sequence generation lengths. Notably, R3 outperforms both GSPO and TIS on stability and solution quality metrics and, when combined with GSPO, can yield further improvements (although the same does not hold for TIS).

Replay-based approaches in discrete action policy learning and LLM fine-tuning demonstrate marked gains in sample efficiency, wall-clock time, and convergence rate—between 25% and 65% improvement in RL fine-tuning time has been reported (Sun et al., 5 Jun 2025, Li et al., 26 May 2024). These results are contingent on judicious buffer size selection and truncation criteria.

6. Practical Applications and Broader Implications

R3 enables robust RL in large-scale MoE LLMs—a critical capability for domains where policy collapse or unreliable expert participation can undermine downstream performance (e.g., dialogue generation, agent planning, reasoning tasks). The routing replay principle can generalize to any setting where nondeterministic or stateful decision gates can cause phase-level behavioral drift. It offers a scalable and portable methodology for aligning discrete selection paths across system boundaries and thus stabilizing learning. By extension, replay buffer techniques accelerate on-policy learning in both control and LLM settings, particularly under sparse reward or high-dimensional search constraints.

7. Directions for Future Research

The explicit mask replay scheme introduced by R3 points toward broader research trajectories: the design of synchronization protocols between heterogeneous engines, alignment schemes for other nondeterministic policy modules, and analysis of replay-augmented learning under dynamic environment distributions. A plausible implication is that further extending R3 from expert routing to other forms of stochastic phase-level decisions (such as activation gating, memory retrieval, or modular network selection) may provide similarly strong improvements in RL stability and efficiency. Systematic benchmarking across architectures and RL settings will clarify its generality and inform best practices for integration at scale.