Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts (2510.23027v1)

Published 27 Oct 2025 in cs.LG and cs.CL

Abstract: Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale LLMs, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.

Summary

The paper introduces RSPO, a router-aware RL algorithm that mitigates unstable router fluctuations and reward collapse in MoE models.
It leverages a router shift ratio and geometric mean aggregation to stabilize token-level importance sampling across policy updates.
Experimental results show that RSPO outperforms GRPO, GSPO, and GMPO with consistent gains on diverse mathematical reasoning benchmarks.

Stable and Effective Reinforcement Learning for Mixture-of-Experts

Introduction

This paper addresses the instability and inefficiency of reinforcement learning (RL) in Mixture-of-Experts (MoE) architectures, a critical challenge for scaling LLMs with sparse activation. While RL has driven significant advances in LLM reasoning and generation, most prior work focuses on dense models, leaving MoE-specific RL training underexplored. The central issue is router fluctuation: the set of experts selected for a given token can vary substantially across policy updates, leading to high variance in importance sampling (IS) weights and frequent reward collapse. Existing stabilization strategies, such as router freezing and routing replay, either restrict model adaptability or degrade exploration, failing to resolve the underlying instability. The paper introduces Router-Shift Policy Optimization (RSPO), a router-aware RL algorithm that quantifies routing deviation and adaptively reweights updates, achieving stable and efficient training for MoE models.

Figure 1: Router fluctuation in off-policy training, illustrating the instability in expert selection across policy updates.

GRPO, GSPO, and GMPO

Group Relative Policy Optimization (GRPO) extends PPO by removing the value model and using group-relative advantage estimation, but suffers from high computational cost and variance mismatch when applied to MoE models. GSPO and GMPO address the misalignment between sequence-level rewards and token-level IS ratios by aggregating IS ratios at the sequence level using geometric means, which is more robust to extreme token-level ratios and better suited for MoE architectures.

Methodology: Router-Shift Policy Optimization (RSPO)

RSPO introduces a router shift ratio, $\gamma_{i,t}$ , which measures the deviation in router distributions between the current and previous policies for each token. This ratio is computed by averaging the log-probability differences of the top- $K$ experts across all layers, then exponentiating and applying a floor $\gamma_{\min}$ to prevent over-suppression. The router shift ratio is used to reweight token-level IS ratios after clipping, down-weighting tokens with excessive routing drift while preserving router flexibility. RSPO aggregates IS ratios using a geometric mean at the sequence level, further stabilizing training.

Figure 2: Ablation of $\gamma_{\min}$ , demonstrating the impact of the router shift ratio floor on training stability and performance.

The PyTorch implementation involves three steps: computing the router shift ratio, clipping extreme values, and applying the ratio to smooth the IS weights. Gradients are stopped through $\gamma_{i,t}$ to prevent variance amplification.

Experimental Results

RSPO is evaluated on Qwen2.5 and Qwen3-30B-A3B models across five mathematical reasoning benchmarks (AIME24, AMC23, MATH500, Minerva, OlympiadBench) using the Pass@1 metric. RSPO consistently outperforms GRPO, GSPO, and GMPO, with particularly strong gains on Minerva and OlympiadBench. Notably, GRPO exhibits reward collapse and high volatility, while RSPO maintains stable and superior validation scores throughout training.

Figure 3: Validation score of GRPO, highlighting instability and reward collapse compared to RSPO and other baselines.

Ablation studies show that backpropagating through $\gamma$ leads to early collapse, while stopping gradients yields smooth optimization. Setting $\gamma_{\min}$ too low suppresses learning signals, while moderate values maintain stability. Alternative router stabilization strategies (freezing, copying logits, reusing expert indices) fail to improve stability, underscoring the necessity of router-aware reweighting.

RSPO is compatible with other RL algorithms; integrating router shift with GSPO or GRPO further improves stability and performance, demonstrating its generality.

Training Stability Analysis

RSPO maintains higher token entropy during training, preserving expert diversity and preventing premature collapse. The ratio of non-shifted routing increases as training progresses, indicating that the policy naturally aligns with prior routing decisions and reduces the need for shift corrections.

Figure 4: Entropy during training, showing that RSPO preserves higher token entropy and expert diversity compared to baselines.

Implications and Future Directions

RSPO provides a principled solution to the instability of RL in MoE models by explicitly accounting for router shift and leveraging geometric mean aggregation. This approach enables reliable deployment of large-scale expert models for complex reasoning tasks. The findings suggest that router-aware reweighting is essential for stable off-policy RL in sparse architectures, and that geometric mean aggregation should be preferred over arithmetic mean for IS ratios.

Future research may explore adaptive mechanisms for router shift ratio computation, integration with differentiable routing techniques, and extension to multimodal or hierarchical MoE architectures. Understanding the interplay between routing dynamics, gradient variance, and RL credit assignment remains an open direction, with potential impact on scaling LLMs to even larger capacities.

Conclusion

RSPO stabilizes RL training for MoE models by combining sequence-level importance sampling with router shift-aware reweighting, outperforming existing baselines in both stability and effectiveness. The method maintains high token entropy, mitigates reward collapse, and generalizes across tasks and algorithms. RSPO represents a significant advance in RL algorithmic design for sparse expert architectures, facilitating robust and efficient training of large-scale reasoning models.