Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 128 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

IcePop: Stabilizing RL in MoE Models

Updated 23 October 2025
  • IcePop is an algorithmic mechanism that stabilizes reinforcement learning in large-scale mixture-of-experts models by masking token discrepancies.
  • It employs token-wise masking based on acceptable probability ratios, effectively discarding unstable RL updates to prevent gradient divergence.
  • IcePop’s integration in the Ring-1T model demonstrates significant empirical gains, yielding improved chain-of-thought reasoning and robust benchmark scores.

IcePop is an algorithmic mechanism for stabilizing reinforcement learning (RL) in large-scale Mixture-of-Experts (MoE) LLMs, specifically designed to address the compounding instability caused by discrepancies between training and inference probability distributions in token-level decision sequences. Introduced in the context of training the Ring-1T trillion-parameter model, IcePop operates by applying token-wise masking and selective gradient clipping, discarding RL updates where local probability ratios between engines are unacceptably distorted. By enforcing per-token calibration and pruning noisy signals, IcePop achieves scalable, high-stability RL training, thereby enabling unprecedented model performance metrics in chain-of-thought (CoT) reasoning at trillion-parameter scale (Team et al., 21 Oct 2025).

1. Motivation and Context: RL Instability at Scale

Large Mixture-of-Experts architectures such as Ring-1T feature a dynamically activated ensemble of expert subnetworks during inference and training, resulting in subtle but persistent mismatches between the probabilities computed during gradient updates (πtrain\pi_\text{train}) and those used for policy evaluation (πinfer\pi_\text{infer}). For multi-step chain-of-thought rollouts, small discrepancies compound, introducing high-variance gradients and destabilizing RL convergence. Traditionally, policy-gradient and importance-sampling methods rely on well-matched sampling distributions, and their breakdown at scale—particularly under long rollouts—is the compressive challenge IcePop addresses.

2. Algorithmic Details: Token-Level Discrepancy Masking and Clipping

The key principle in IcePop is the per-token masking of gradients, predicated on an “acceptable” probability ratio interval for each output token. Defining

ki,t=πtrain(yi,t)πinfer(yi,t),k_{i,t} = \frac{\pi_\text{train}(y_{i,t}|\ldots)}{\pi_\text{infer}(y_{i,t}|\ldots)},

IcePop applies a masking function M(k)\mathcal{M}(k) such that

M(k)={k,if k[α,β] 0,otherwise\mathcal{M}(k) = \begin{cases} k, & \text{if } k \in [\alpha, \beta] \ 0, & \text{otherwise} \end{cases}

with typical bounds α=0.5\alpha = 0.5, β=5.0\beta = 5.0 encoded in the RL objective.

The algorithm’s objective function is: JIcePop(θ)=ExD,{yi}πinfer[1Gi=1G1yit=1yi{M(ki,t,α,β)fclip(ri,tA^i,t)γDKL(πθπref)}]\mathcal{J}_\text{IcePop}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\} \sim \pi_\text{infer}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \left\{ \mathcal{M}(k_{i,t}, \alpha, \beta) \cdot f_\text{clip}(r_{i,t} \hat{A}_{i,t}) - \gamma D_\text{KL}(\pi_\theta \,\|\, \pi_\text{ref}) \right\} \right] where fclip()f_\text{clip}(\cdot) is a gradient clipping function, ri,tr_{i,t} is the importance ratio as usual, and A^i,t\hat{A}_{i,t} is the advantage estimate. Masked tokens contribute no gradient, suppressing the noisy signal from extreme mismatches.

Gradient updates are thus calculated: θJIcePop(θ)Eaπinfer[M(πtrain(a;θold)πinfer(a;θold),α,β)θlogπtrain(a;θ)A^r(a)]\nabla_\theta \mathcal{J}_\text{IcePop}(\theta) \sim \mathbb{E}_{a \sim \pi_\text{infer}} \left[ \mathcal{M}\left( \frac{\pi_\text{train}(a;\theta_\text{old})}{\pi_\text{infer}(a;\theta_\text{old})}, \alpha, \beta \right) \cdot \nabla_\theta \log \pi_\text{train}(a;\theta) \cdot \hat{A} \cdot r(a) \right]

3. Theoretical Justification: Discrepancy Compounding and Gradient Stability

The effect of IcePop is formalized via an analysis of discrepancy compounding. In the absence of per-token calibration, the deviation δt\delta_t between πinfer\pi_\text{infer} and πtrain\pi_\text{train} increases multiplicatively with token position, rapidly causing gradient explosions or divergence. IcePop’s mechanism ensures that the effective discrepancy remains upper-bounded, as only those tokens compatible with the constrained ratio [α,β][\alpha, \beta] contribute. This targeted suppression is proven in the paper to prevent train-inference misalignment from propagating, thereby maintaining the stability of long-horizon RL updates—a critical capability at the trillion-parameter regime.

4. Integration with RL Training Systems and Empirical Effects

IcePop is embedded directly in the on-policy RL pipeline of Ring-1T. During each rollout, the RL trainer computes gradients only on acceptably matched tokens across the trajectory. Ablation studies demonstrate that use of IcePop yields bounded gradient norms and KL divergence throughout training, even as baseline RL methods without IcePop experience gradient instability beyond several dozen steps. Experimentally, IcePop is shown to maintain exploration via more diffuse, less confident log-probabilities for tokens, supporting richer hypothesis space traversal.

5. Performance Impact and Benchmark Results

The Ring-1T model equipped with IcePop secures high performance on a range of reasoning benchmarks—93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, 55.94 on ARC-AGI-v1, and a silver medal-level result on the IMO-2025. These scores are strongly associated with RL stability improvements contributed by IcePop. The paper reports that in the absence of IcePop-style masking and clipping, model scores plateau or degrade as RL divergence increases, substantiating the critical role of token-level discrepancy control. IcePop’s impact thus extends to both the scalability and reliability of open-source reasoning systems.

6. Implementation and Practical Considerations

Deploying IcePop within large MoE systems requires calibrated selection of ratio bounds (α\alpha, β\beta), empirical monitoring of log-probabilities, and careful integration with standard RL objectives (e.g., GRPO). The masking is token-local and computationally tractable, involving only per-token ratio checks and application in the gradient accumulation loop. While IcePop reduces noisy updates, a plausible implication is that extreme masking could, if misconfigured, suppress rare but informative signal—thus trade-off analysis is vital in practical deployments.

7. Significance for Scalable Reasoning Intelligence

IcePop constitutes a principal innovation in RL algorithmics required for the trillion-scale era. By regularizing the RL signal through strict per-token calibration, IcePop ensures that model scaling—on both architecture size and rollout length—remains tractable and stable. The approach sets a new technical baseline for open-source RL training with extreme-scale MoE models and is directly credited for enabling milestone benchmark scores and broad accessibility of highly capable reasoning engines (Team et al., 21 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to IcePop.