Papers
Topics
Authors
Recent
Search
2000 character limit reached

CISPO: Clipped Importance Sampling RL

Updated 27 January 2026
  • CISPO is a reinforcement learning algorithm that clips token-level importance sampling weights to bound variance and retain learning signals for every token.
  • It modifies the off-policy REINFORCE objective to prevent token dropout, thereby enhancing stability and sample efficiency compared to methods like PPO.
  • Empirical results demonstrate that CISPO converges faster and achieves higher accuracy in large-scale autoregressive models by preserving critical token updates.

CISPO (Clipped Importance Sampling Policy Optimization) is a reinforcement learning (RL) algorithm developed to enhance stability and computational efficiency in large-scale, off-policy policy-gradient training, particularly for autoregressive LLMs such as MiniMax-M1. CISPO modifies the standard off-policy REINFORCE objective by clipping the importance sampling (IS) weight applied to each token’s likelihood gradient, ensuring that all tokens contribute to policy update while tightly bounding update variance. CISPO achieves smoother training trajectories, improved sample efficiency, and superior empirical performance relative to contemporary approaches such as PPO (Proximal Policy Optimization), GRPO (Group-Relative Policy Optimization), and DAPO (Direct Advantage Policy Optimization) (MiniMax et al., 16 Jun 2025).

1. Mathematical Formulation and Learning Framework

The CISPO algorithm operates in a batched RL setting, utilizing prompts qDq \in \mathcal{D} and an autoregressive policy πθ(q)\pi_\theta(\cdot|q) to generate GG responses per prompt. Each response oi=(oi,1,,oi,T)o_i = (o_{i,1}, \ldots, o_{i,T}) receives a scalar reward RiR_i. Group-relative token-level advantage estimates are constructed as

A^i,t=Rimeanj=1G(Rj)stdj=1G(Rj),\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,

assigning higher advantage to tokens from high-rewarded responses.

During off-policy optimization (training policy πθ\pi_\theta on samples from πθold\pi_{\theta_{\text{old}}}), the per-token IS weight is defined as

ri,t(θ)=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t).r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|q,o_{i,<t})}\,.

CISPO applies a clipped IS weight for every token:

r^i,t(θ)=clip(ri,t(θ),1ϵlowIS,1+ϵhighIS),\hat{r}_{i,t}(\theta) = \text{clip}(r_{i,t}(\theta), 1-\epsilon_{\text{low}}^{IS}, 1+\epsilon_{\text{high}}^{IS})\,,

where ϵlowIS\epsilon_{\text{low}}^{IS} is typically much larger than $1$ (practically disabling lower clipping), and ϵhighIS\epsilon_{\text{high}}^{IS} is tuned within [0.1,0.5][0.1,0.5] (e.g., $0.3$ for best trade-off).

The CISPO objective for policy update is

JCISPO(θ)=Eq,{oi}πθold[1ioii=1Gt=1oisg(r^i,t(θ))A^i,tlogπθ(oi,tq,oi,<t)],J_{\text{CISPO}}(\theta) = \mathbb{E}_{q,\,\{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \text{sg}(\hat{r}_{i,t}(\theta))\,\hat{A}_{i,t}\,\log\pi_\theta(o_{i,t}|q,o_{i,<t}) \right]\,,

where sg()\text{sg}(\cdot) denotes stop-gradient.

CISPO introduces a conceptual divergence from PPO and related RL variants in handling update stability. PPO clips the surrogate objective:

LtPPO=min(ri,tA^i,t,clip(ri,t,1ϵ,1+ϵ)A^i,t),L^{\text{PPO}}_t = \min\left(r_{i,t} \hat{A}_{i,t}, \text{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon) \hat{A}_{i,t}\right)\,,

which may result in token updates being zeroed out if ri,tr_{i,t} breaches the trust-region in the unfavorable direction. This eroding of token contributions is detrimental when training on tasks requiring extensive chain-of-thought modeling—where key tokens may initially possess low probability.

In contrast, CISPO exclusively clips the IS weight r^i,t\hat{r}_{i,t} but retains each token's gradient update. No tokens are masked or dropped, preventing the loss of learning signal for rare yet critical tokens.

3. Algorithmic Workflow and Pseudocode

The CISPO routine proceeds as follows:

  • Sample batches (BB prompts), each with GG autoregressive responses under πθold\pi_{\theta_{\text{old}}}.
  • Compute scalar rewards, group mean (μb\mu_b), and standard deviation (σb\sigma_b).
  • For each token (i,t)(i,t):
    • Calculate A^b,i,t\hat{A}_{b,i,t}.
    • Compute IS weight rb,i,tr_{b,i,t}, clip to obtain r^b,i,t\hat{r}_{b,i,t}.
    • Calculate loss: b,i,t=sg(r^b,i,t)A^b,i,tlogπθ(ob,i,tqb,ob,i,<t)\ell_{b,i,t} = -\text{sg}(\hat{r}_{b,i,t})\,\hat{A}_{b,i,t}\,\log\pi_\theta(o_{b,i,t}|q_b, o_{b,i,<t}).
  • Update θ\theta via AdamW optimizer; refresh θold\theta_{\text{old}} every KK steps.

Pseudocode adheres to the element-wise computation of ratios and clipping (complexity O(N)O(N) where NN is token count per batch). This procedure requires no additional large-matrix computations or secondary networks.

4. Variance Reduction and Signal Preservation

Unclipped IS weights can yield outlier magnitude in gradient estimates, especially when the target policy πθ\pi_\theta assigns sharply increased probability to rare tokens in the sampled trajectories. Such variance destabilizes training. Clipping at 1+ϵhigh1+\epsilon_{\text{high}} (e.g., $0.3$) mitigates these spikes, leading to smoother reward and accuracy trajectories as observed empirically in MiniMax-M1 experiments.

Crucially, CISPO preserves the learning signal for all tokens, including those with low base probability but high subsequent reward (“Aha,” “Wait”). Empirical analysis demonstrates these are essential for chain-of-thought stability, in contrast to PPO's token-dropping tendency.

5. Computational Implementation and Complexity

The overhead of CISPO is negligible relative to standard RL objectives. Clipping IS weights is a simple O(N)O(N) operation and, in practice, both ratio computation and clipping are fused into a single GPU kernel. CISPO is compatible with mixed or FP32 precision heads, crucially maintaining congruence between training and inference token probabilities.

No value function estimators or large auxiliary matrices are required, reducing memory consumption and implementation complexity compared to certain actor–critic architectures.

6. Hyperparameter Choices and Ablations

Key CISPO hyperparameters:

  • ϵhighIS\epsilon_{\text{high}}^{IS}: typical range $0.2$–$0.4$ ($0.3$ offers optimal performance per ablation).
  • Group size G=16G = 16 responses per prompt.
  • Off-policy update refresh frequency K=16K = 16.
  • AdamW optimizer with learning rate 2×106\approx 2\times10^{-6}, β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, ϵ=1015\epsilon=10^{-15}.
  • Batch size BB set such that total tokens per step 106\approx 10^6.

Ablation studies (Fig. 3 in (MiniMax et al., 16 Jun 2025)) show that ϵhigh=0.3\epsilon_{\text{high}}=0.3 yields superior variance reduction without sacrificing update bias.

7. Empirical Results and Comparative Performance

In AIME 2024 mathematical reasoning benchmark using Qwen2.5-32B-base, CISPO achieves 76%76\% accuracy after 10,00010{,}000 RL steps, compared to GRPO (68%\sim68\%), DAPO (72%\sim72\%). To match DAPO’s final performance, GRPO requires approximately 2×2\times more training, DAPO 1.5×1.5\times, while CISPO converges fastest.

CISPO’s sample efficiency is also noteworthy, reaching DAPO’s performance peak in approximately 50%50\% of the corresponding wall-clock time. Training trajectories are markedly smoother with fewer oscillations, validating the variance-reduction and signal preservation design. In summary, CISPO is a principled modification of the off-policy policy-gradient family: bounding IS weights rather than dropping surrogate objective terms, enabling faster, more stable RL training in large-scale reasoning models (MiniMax et al., 16 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CISPO Algorithm.