Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (2510.11370v1)

Published 13 Oct 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of LLMs. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.

Summary

The paper introduces Rollout Routing Replay (R3) to align training and inference routers in MoE RL, significantly reducing instability.
It demonstrates reduced KL divergence and token discrepancies, validating the method using models like Qwen3-30B-A3B.
The R3 technique stabilizes both on-policy and off-policy RL training without extra computational overhead, paving the way for robust MoE models.

Stabilizing Reinforcement Learning with Mixture-of-Experts by Aligning Routers

The paper "Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers" explores stabilizing reinforcement learning (RL) in Mixture-of-Experts (MoE) models by proposing a technique known as Rollout Routing Replay (R3). This approach resolves significant discrepancies in expert routing behavior between training and inference phases, which can lead to RL training instability and potential collapse.

Introduction

Reinforcement learning serves as a critical means to enhance LLMs by post-training them to solve complex tasks, such as advanced mathematical reasoning and practical coding tasks. A critical challenge faced in this context is ensuring the stability of RL training, particularly in MoE models, where distinct routing mechanisms during training and inference introduce variations that jeopardize overall system robustness. Existing solutions, such as importance sampling and specialized kernels, fail to address the fundamental source of this instability in MoE models—the routing distribution.

Rollout Routing Replay (R3)

The core contribution of the paper is the Rollout Routing Replay method, which addresses the root cause of routing-induced instability in MoE RL. The R3 technique records routing distributions at inference time and replays them during training, thereby aligning router behavior across both phases. This alignment significantly reduces the KL divergence between training and inference policies and minimizes token discrepancy.

Figure 1: Illustration of the Rollout Routing Replay (R3) and its effect on training and inference discrepancies.

The process leverages cached routing masks, making R3 compatible with multi-turn dialogue tasks and prefix caching strategies used in agent tasks, which is crucial for maintaining efficient computation in large-scale models.

Training-Inference Discrepancies in MoE Models

The paper provides an in-depth analysis of discrepancies between training and inference in MoE frameworks. These discrepancies, largely attributable to dynamic expert selection by routers, result in output variability that exceeds that seen in dense models.

Figure 2: Analysis of training-inference collapse showing the estimated KL divergence and extreme token distribution.

Through empirical analysis using models such as Qwen3-30B-A3B, the researchers demonstrate that MoE models exhibit larger training-inference discrepancies than dense models due to routing inconsistencies. This phenomenon is further quantified using KL divergence and extreme token distribution metrics.

Experimental Results

To gauge R3’s efficacy, the authors conducted experiments where the method was integrated into various RL scenarios involving MoE models. The findings were compelling:

R3 significantly stabilized training processes, preventing model collapse and improving overall performance when compared with existing methods such as GSPO and TIS.
It demonstrated robustness across both on-policy and mini-batch style off-policy RL scenarios, surpassing dense model stability without added computational overhead.

Figure 3: Training dynamics of the Qwen3-30B-A3B-Base model showing response length, gradient norm, entropy, and average validation score.

The reduction in training-inference discrepancies through R3 was also visually confirmed through scatter plots and extreme token distribution analyses, showcasing an order of magnitude decrease in routing disparities.

Conclusion

The presented work identifies routing discrepancies as the primary instability source in MoE RL frameworks and introduces R3 as an effective countermeasure by reusing inference routing information during training. This technique mitigates instability issues and enhances training performance, offering a viable pathway for achieving robustness in MoE-based reinforcement learning systems.

The implications of this research extend beyond immediate performance gains, paving the way for more reliable deployment of LLMs in complex environments. Future work could explore further optimizations and applications of the R3 technique to other architectures and domains, potentially broadening its scope and utility.