CISPO: Clipped Importance Sampling RL

Updated 27 January 2026

CISPO is a reinforcement learning algorithm that clips token-level importance sampling weights to bound variance and retain learning signals for every token.
It modifies the off-policy REINFORCE objective to prevent token dropout, thereby enhancing stability and sample efficiency compared to methods like PPO.
Empirical results demonstrate that CISPO converges faster and achieves higher accuracy in large-scale autoregressive models by preserving critical token updates.

CISPO (Clipped Importance Sampling Policy Optimization) is a reinforcement learning (RL) algorithm developed to enhance stability and computational efficiency in large-scale, off-policy policy-gradient training, particularly for autoregressive LLMs such as MiniMax-M1. CISPO modifies the standard off-policy REINFORCE objective by clipping the importance sampling (IS) weight applied to each token’s likelihood gradient, ensuring that all tokens contribute to policy update while tightly bounding update variance. CISPO achieves smoother training trajectories, improved sample efficiency, and superior empirical performance relative to contemporary approaches such as PPO (Proximal Policy Optimization), GRPO (Group-Relative Policy Optimization), and DAPO (Direct Advantage Policy Optimization) (MiniMax et al., 16 Jun 2025).

1. Mathematical Formulation and Learning Framework

The CISPO algorithm operates in a batched RL setting, utilizing prompts $q \in \mathcal{D}$ and an autoregressive policy $\pi_\theta(\cdot|q)$ to generate $G$ responses per prompt. Each response $o_i = (o_{i,1}, \ldots, o_{i,T})$ receives a scalar reward $R_i$ . Group-relative token-level advantage estimates are constructed as

$\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$

assigning higher advantage to tokens from high-rewarded responses.

During off-policy optimization (training policy $\pi_\theta$ on samples from $\pi_{\theta_{\text{old}}}$ ), the per-token IS weight is defined as

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|q,o_{i,<t})}\,.$

CISPO applies a clipped IS weight for every token:

$\hat{r}_{i,t}(\theta) = \text{clip}(r_{i,t}(\theta), 1-\epsilon_{\text{low}}^{IS}, 1+\epsilon_{\text{high}}^{IS})\,,$

The CISPO objective for policy update is

$\pi_\theta(\cdot|q)$ 5

where $\pi_\theta(\cdot|q)$ 6 denotes stop-gradient.

CISPO introduces a conceptual divergence from PPO and related RL variants in handling update stability. PPO clips the surrogate objective:

$\pi_\theta(\cdot|q)$ 7

which may result in token updates being zeroed out if $\pi_\theta(\cdot|q)$ 8 breaches the trust-region in the unfavorable direction. This eroding of token contributions is detrimental when training on tasks requiring extensive chain-of-thought modeling—where key tokens may initially possess low probability.

In contrast, CISPO exclusively clips the IS weight $\pi_\theta(\cdot|q)$ 9 but retains each token's gradient update. No tokens are masked or dropped, preventing the loss of learning signal for rare yet critical tokens.

3. Algorithmic Workflow and Pseudocode

The CISPO routine proceeds as follows:

Sample batches ( $G$ 0 prompts), each with $G$ 1 autoregressive responses under $G$ 2.
Compute scalar rewards, group mean ( $G$ 3), and standard deviation ( $G$ 4).
For each token $G$ $G$ 5:
- Calculate $G$ 6.
- Compute IS weight $G$ 7, clip to obtain $G$ 8.
- Calculate loss: $G$ 9.
Update $o_i = (o_{i,1}, \ldots, o_{i,T})$ 0 via AdamW optimizer; refresh $o_i = (o_{i,1}, \ldots, o_{i,T})$ 1 every $o_i = (o_{i,1}, \ldots, o_{i,T})$ 2 steps.

Pseudocode adheres to the element-wise computation of ratios and clipping (complexity $o_i = (o_{i,1}, \ldots, o_{i,T})$ 3 where $o_i = (o_{i,1}, \ldots, o_{i,T})$ 4 is token count per batch). This procedure requires no additional large-matrix computations or secondary networks.

4. Variance Reduction and Signal Preservation

Unclipped IS weights can yield outlier magnitude in gradient estimates, especially when the target policy $o_i = (o_{i,1}, \ldots, o_{i,T})$ 5 assigns sharply increased probability to rare tokens in the sampled trajectories. Such variance destabilizes training. Clipping at $o_i = (o_{i,1}, \ldots, o_{i,T})$ 6 (e.g., $o_i = (o_{i,1}, \ldots, o_{i,T})$ 7) mitigates these spikes, leading to smoother reward and accuracy trajectories as observed empirically in MiniMax-M1 experiments.

Crucially, CISPO preserves the learning signal for all tokens, including those with low base probability but high subsequent reward (“Aha,” “Wait”). Empirical analysis demonstrates these are essential for chain-of-thought stability, in contrast to PPO's token-dropping tendency.

5. Computational Implementation and Complexity

The overhead of CISPO is negligible relative to standard RL objectives. Clipping IS weights is a simple $o_i = (o_{i,1}, \ldots, o_{i,T})$ 8 operation and, in practice, both ratio computation and clipping are fused into a single GPU kernel. CISPO is compatible with mixed or FP32 precision heads, crucially maintaining congruence between training and inference token probabilities.

No value function estimators or large auxiliary matrices are required, reducing memory consumption and implementation complexity compared to certain actor–critic architectures.

6. Hyperparameter Choices and Ablations

Key CISPO hyperparameters:

$o_i = (o_{i,1}, \ldots, o_{i,T})$ 9: typical range $R_i$ 0– $R_i$ 1 ( $R_i$ 2 offers optimal performance per ablation).
Group size $R_i$ 3 responses per prompt.
Off-policy update refresh frequency $R_i$ 4.
AdamW optimizer with learning rate $R_i$ 5, $R_i$ 6, $R_i$ 7, $R_i$ 8.
Batch size $R_i$ 9 set such that total tokens per step $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 0.

Ablation studies (Fig. 3 in (MiniMax et al., 16 Jun 2025)) show that $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 1 yields superior variance reduction without sacrificing update bias.

7. Empirical Results and Comparative Performance

In AIME 2024 mathematical reasoning benchmark using Qwen2.5-32B-base, CISPO achieves $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 2 accuracy after $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 3 RL steps, compared to GRPO ( $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 4), DAPO ( $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 5). To match DAPO’s final performance, GRPO requires approximately $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 6 more training, DAPO $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 7, while CISPO converges fastest.

CISPO’s sample efficiency is also noteworthy, reaching DAPO’s performance peak in approximately $\hat{A}_{i,t} = \frac{R_i - \operatorname{mean}_{j=1\ldots G}(R_j)}{\operatorname{std}_{j=1\ldots G}(R_j)}\,,$ 8 of the corresponding wall-clock time. Training trajectories are markedly smoother with fewer oscillations, validating the variance-reduction and signal preservation design. In summary, CISPO is a principled modification of the off-policy policy-gradient family: bounding IS weights rather than dropping surrogate objective terms, enabling faster, more stable RL training in large-scale reasoning models (MiniMax et al., 16 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CISPO Algorithm.