Papers
Topics
Authors
Recent
Search
2000 character limit reached

CISPO: Clipped Importance Sampling RL

Updated 27 January 2026
  • CISPO is a reinforcement learning algorithm that clips token-level importance sampling weights to bound variance and retain learning signals for every token.
  • It modifies the off-policy REINFORCE objective to prevent token dropout, thereby enhancing stability and sample efficiency compared to methods like PPO.
  • Empirical results demonstrate that CISPO converges faster and achieves higher accuracy in large-scale autoregressive models by preserving critical token updates.

CISPO (Clipped Importance Sampling Policy Optimization) is a reinforcement learning (RL) algorithm developed to enhance stability and computational efficiency in large-scale, off-policy policy-gradient training, particularly for autoregressive LLMs such as MiniMax-M1. CISPO modifies the standard off-policy REINFORCE objective by clipping the importance sampling (IS) weight applied to each token’s likelihood gradient, ensuring that all tokens contribute to policy update while tightly bounding update variance. CISPO achieves smoother training trajectories, improved sample efficiency, and superior empirical performance relative to contemporary approaches such as PPO (Proximal Policy Optimization), GRPO (Group-Relative Policy Optimization), and DAPO (Direct Advantage Policy Optimization) (MiniMax et al., 16 Jun 2025).

1. Mathematical Formulation and Learning Framework

The CISPO algorithm operates in a batched RL setting, utilizing prompts qDq \in \mathcal{D} and an autoregressive policy πθ(q)\pi_\theta(\cdot|q) to generate GG responses per prompt. Each response oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T}) receives a scalar reward RiR_i. Group-relative token-level advantage estimates are constructed as

A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,

assigning higher advantage to tokens from high-rewarded responses.

During off-policy optimization (training policy πθ\pi_\theta on samples from πθold\pi_{\theta_{\text{old}}}), the per-token IS weight is defined as

ri,t(θ)=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t).r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|q,o_{i,<t})}\,.

CISPO applies a clipped IS weight for every token:

r^i,t(θ)=clip(ri,t(θ),1ϵlowIS,1+ϵhighIS),\hat{r}_{i,t}(\theta) = \text{clip}(r_{i,t}(\theta), 1-\epsilon_{\text{low}}^{IS}, 1+\epsilon_{\text{high}}^{IS})\,,

where πθ(q)\pi_\theta(\cdot|q)0 is typically much larger than πθ(q)\pi_\theta(\cdot|q)1 (practically disabling lower clipping), and πθ(q)\pi_\theta(\cdot|q)2 is tuned within πθ(q)\pi_\theta(\cdot|q)3 (e.g., πθ(q)\pi_\theta(\cdot|q)4 for best trade-off).

The CISPO objective for policy update is

πθ(q)\pi_\theta(\cdot|q)5

where πθ(q)\pi_\theta(\cdot|q)6 denotes stop-gradient.

CISPO introduces a conceptual divergence from PPO and related RL variants in handling update stability. PPO clips the surrogate objective:

πθ(q)\pi_\theta(\cdot|q)7

which may result in token updates being zeroed out if πθ(q)\pi_\theta(\cdot|q)8 breaches the trust-region in the unfavorable direction. This eroding of token contributions is detrimental when training on tasks requiring extensive chain-of-thought modeling—where key tokens may initially possess low probability.

In contrast, CISPO exclusively clips the IS weight πθ(q)\pi_\theta(\cdot|q)9 but retains each token's gradient update. No tokens are masked or dropped, preventing the loss of learning signal for rare yet critical tokens.

3. Algorithmic Workflow and Pseudocode

The CISPO routine proceeds as follows:

  • Sample batches (GG0 prompts), each with GG1 autoregressive responses under GG2.
  • Compute scalar rewards, group mean (GG3), and standard deviation (GG4).
  • For each token GG5:
    • Calculate GG6.
    • Compute IS weight GG7, clip to obtain GG8.
    • Calculate loss: GG9.
  • Update oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})0 via AdamW optimizer; refresh oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})1 every oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})2 steps.

Pseudocode adheres to the element-wise computation of ratios and clipping (complexity oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})3 where oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})4 is token count per batch). This procedure requires no additional large-matrix computations or secondary networks.

4. Variance Reduction and Signal Preservation

Unclipped IS weights can yield outlier magnitude in gradient estimates, especially when the target policy oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})5 assigns sharply increased probability to rare tokens in the sampled trajectories. Such variance destabilizes training. Clipping at oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})6 (e.g., oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})7) mitigates these spikes, leading to smoother reward and accuracy trajectories as observed empirically in MiniMax-M1 experiments.

Crucially, CISPO preserves the learning signal for all tokens, including those with low base probability but high subsequent reward (“Aha,” “Wait”). Empirical analysis demonstrates these are essential for chain-of-thought stability, in contrast to PPO's token-dropping tendency.

5. Computational Implementation and Complexity

The overhead of CISPO is negligible relative to standard RL objectives. Clipping IS weights is a simple oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})8 operation and, in practice, both ratio computation and clipping are fused into a single GPU kernel. CISPO is compatible with mixed or FP32 precision heads, crucially maintaining congruence between training and inference token probabilities.

No value function estimators or large auxiliary matrices are required, reducing memory consumption and implementation complexity compared to certain actor–critic architectures.

6. Hyperparameter Choices and Ablations

Key CISPO hyperparameters:

  • oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T})9: typical range RiR_i0–RiR_i1 (RiR_i2 offers optimal performance per ablation).
  • Group size RiR_i3 responses per prompt.
  • Off-policy update refresh frequency RiR_i4.
  • AdamW optimizer with learning rate RiR_i5, RiR_i6, RiR_i7, RiR_i8.
  • Batch size RiR_i9 set such that total tokens per step A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,0.

Ablation studies (Fig. 3 in (MiniMax et al., 16 Jun 2025)) show that A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,1 yields superior variance reduction without sacrificing update bias.

7. Empirical Results and Comparative Performance

In AIME 2024 mathematical reasoning benchmark using Qwen2.5-32B-base, CISPO achieves A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,2 accuracy after A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,3 RL steps, compared to GRPO (A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,4), DAPO (A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,5). To match DAPO’s final performance, GRPO requires approximately A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,6 more training, DAPO A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,7, while CISPO converges fastest.

CISPO’s sample efficiency is also noteworthy, reaching DAPO’s performance peak in approximately A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,8 of the corresponding wall-clock time. Training trajectories are markedly smoother with fewer oscillations, validating the variance-reduction and signal preservation design. In summary, CISPO is a principled modification of the off-policy policy-gradient family: bounding IS weights rather than dropping surrogate objective terms, enabling faster, more stable RL training in large-scale reasoning models (MiniMax et al., 16 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CISPO Algorithm.