IcePop: Stabilizing RL in MoE Models

Updated 23 October 2025

IcePop is an algorithmic mechanism that stabilizes reinforcement learning in large-scale mixture-of-experts models by masking token discrepancies.
It employs token-wise masking based on acceptable probability ratios, effectively discarding unstable RL updates to prevent gradient divergence.
IcePop’s integration in the Ring-1T model demonstrates significant empirical gains, yielding improved chain-of-thought reasoning and robust benchmark scores.

IcePop is an algorithmic mechanism for stabilizing reinforcement learning (RL) in large-scale Mixture-of-Experts (MoE) LLMs, specifically designed to address the compounding instability caused by discrepancies between training and inference probability distributions in token-level decision sequences. Introduced in the context of training the Ring-1T trillion-parameter model, IcePop operates by applying token-wise masking and selective gradient clipping, discarding RL updates where local probability ratios between engines are unacceptably distorted. By enforcing per-token calibration and pruning noisy signals, IcePop achieves scalable, high-stability RL training, thereby enabling unprecedented model performance metrics in chain-of-thought (CoT) reasoning at trillion-parameter scale (Team et al., 21 Oct 2025).

1. Motivation and Context: RL Instability at Scale

Large Mixture-of-Experts architectures such as Ring-1T feature a dynamically activated ensemble of expert subnetworks during inference and training, resulting in subtle but persistent mismatches between the probabilities computed during gradient updates ( $\pi_\text{train}$ ) and those used for policy evaluation ( $\pi_\text{infer}$ ). For multi-step chain-of-thought rollouts, small discrepancies compound, introducing high-variance gradients and destabilizing RL convergence. Traditionally, policy-gradient and importance-sampling methods rely on well-matched sampling distributions, and their breakdown at scale—particularly under long rollouts—is the compressive challenge IcePop addresses.

2. Algorithmic Details: Token-Level Discrepancy Masking and Clipping

The key principle in IcePop is the per-token masking of gradients, predicated on an “acceptable” probability ratio interval for each output token. Defining

$k_{i,t} = \frac{\pi_\text{train}(y_{i,t}|\ldots)}{\pi_\text{infer}(y_{i,t}|\ldots)},$

IcePop applies a masking function $\mathcal{M}(k)$ such that

$\mathcal{M}(k) = \begin{cases} k, & \text{if } k \in [\alpha, \beta] \ 0, & \text{otherwise} \end{cases}$

with typical bounds $\alpha = 0.5$ , $\beta = 5.0$ encoded in the RL objective.

The algorithm’s objective function is: $\mathcal{J}_\text{IcePop}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\} \sim \pi_\text{infer}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \left\{ \mathcal{M}(k_{i,t}, \alpha, \beta) \cdot f_\text{clip}(r_{i,t} \hat{A}_{i,t}) - \gamma D_\text{KL}(\pi_\theta \,\|\, \pi_\text{ref}) \right\} \right]$ where $f_\text{clip}(\cdot)$ is a gradient clipping function, $r_{i,t}$ is the importance ratio as usual, and $\hat{A}_{i,t}$ is the advantage estimate. Masked tokens contribute no gradient, suppressing the noisy signal from extreme mismatches.

Gradient updates are thus calculated: $\nabla_\theta \mathcal{J}_\text{IcePop}(\theta) \sim \mathbb{E}_{a \sim \pi_\text{infer}} \left[ \mathcal{M}\left( \frac{\pi_\text{train}(a;\theta_\text{old})}{\pi_\text{infer}(a;\theta_\text{old})}, \alpha, \beta \right) \cdot \nabla_\theta \log \pi_\text{train}(a;\theta) \cdot \hat{A} \cdot r(a) \right]$

3. Theoretical Justification: Discrepancy Compounding and Gradient Stability

The effect of IcePop is formalized via an analysis of discrepancy compounding. In the absence of per-token calibration, the deviation $\delta_t$ between $\pi_\text{infer}$ and $\pi_\text{train}$ increases multiplicatively with token position, rapidly causing gradient explosions or divergence. IcePop’s mechanism ensures that the effective discrepancy remains upper-bounded, as only those tokens compatible with the constrained ratio $[\alpha, \beta]$ contribute. This targeted suppression is proven in the paper to prevent train-inference misalignment from propagating, thereby maintaining the stability of long-horizon RL updates—a critical capability at the trillion-parameter regime.

4. Integration with RL Training Systems and Empirical Effects

IcePop is embedded directly in the on-policy RL pipeline of Ring-1T. During each rollout, the RL trainer computes gradients only on acceptably matched tokens across the trajectory. Ablation studies demonstrate that use of IcePop yields bounded gradient norms and KL divergence throughout training, even as baseline RL methods without IcePop experience gradient instability beyond several dozen steps. Experimentally, IcePop is shown to maintain exploration via more diffuse, less confident log-probabilities for tokens, supporting richer hypothesis space traversal.

5. Performance Impact and Benchmark Results

The Ring-1T model equipped with IcePop secures high performance on a range of reasoning benchmarks—93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, 55.94 on ARC-AGI-v1, and a silver medal-level result on the IMO-2025. These scores are strongly associated with RL stability improvements contributed by IcePop. The paper reports that in the absence of IcePop-style masking and clipping, model scores plateau or degrade as RL divergence increases, substantiating the critical role of token-level discrepancy control. IcePop’s impact thus extends to both the scalability and reliability of open-source reasoning systems.

6. Implementation and Practical Considerations

Deploying IcePop within large MoE systems requires calibrated selection of ratio bounds ( $\alpha$ , $\beta$ ), empirical monitoring of log-probabilities, and careful integration with standard RL objectives (e.g., GRPO). The masking is token-local and computationally tractable, involving only per-token ratio checks and application in the gradient accumulation loop. While IcePop reduces noisy updates, a plausible implication is that extreme masking could, if misconfigured, suppress rare but informative signal—thus trade-off analysis is vital in practical deployments.

7. Significance for Scalable Reasoning Intelligence

IcePop constitutes a principal innovation in RL algorithmics required for the trillion-scale era. By regularizing the RL signal through strict per-token calibration, IcePop ensures that model scaling—on both architecture size and rollout length—remains tractable and stable. The approach sets a new technical baseline for open-source RL training with extreme-scale MoE models and is directly credited for enabling milestone benchmark scores and broad accessibility of highly capable reasoning engines (Team et al., 21 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (2025)

Follow Topic

Get notified by email when new papers are published related to IcePop.