CISPO: Clipped Importance Sampling RL
- CISPO is a reinforcement learning algorithm that clips token-level importance sampling weights to bound variance and retain learning signals for every token.
- It modifies the off-policy REINFORCE objective to prevent token dropout, thereby enhancing stability and sample efficiency compared to methods like PPO.
- Empirical results demonstrate that CISPO converges faster and achieves higher accuracy in large-scale autoregressive models by preserving critical token updates.
CISPO (Clipped Importance Sampling Policy Optimization) is a reinforcement learning (RL) algorithm developed to enhance stability and computational efficiency in large-scale, off-policy policy-gradient training, particularly for autoregressive LLMs such as MiniMax-M1. CISPO modifies the standard off-policy REINFORCE objective by clipping the importance sampling (IS) weight applied to each token’s likelihood gradient, ensuring that all tokens contribute to policy update while tightly bounding update variance. CISPO achieves smoother training trajectories, improved sample efficiency, and superior empirical performance relative to contemporary approaches such as PPO (Proximal Policy Optimization), GRPO (Group-Relative Policy Optimization), and DAPO (Direct Advantage Policy Optimization) (MiniMax et al., 16 Jun 2025).
1. Mathematical Formulation and Learning Framework
The CISPO algorithm operates in a batched RL setting, utilizing prompts and an autoregressive policy to generate responses per prompt. Each response receives a scalar reward . Group-relative token-level advantage estimates are constructed as
assigning higher advantage to tokens from high-rewarded responses.
During off-policy optimization (training policy on samples from ), the per-token IS weight is defined as
CISPO applies a clipped IS weight for every token:
where is typically much larger than $1$ (practically disabling lower clipping), and is tuned within (e.g., $0.3$ for best trade-off).
The CISPO objective for policy update is
where denotes stop-gradient.
2. Operational Difference from PPO and Related Methods
CISPO introduces a conceptual divergence from PPO and related RL variants in handling update stability. PPO clips the surrogate objective:
which may result in token updates being zeroed out if breaches the trust-region in the unfavorable direction. This eroding of token contributions is detrimental when training on tasks requiring extensive chain-of-thought modeling—where key tokens may initially possess low probability.
In contrast, CISPO exclusively clips the IS weight but retains each token's gradient update. No tokens are masked or dropped, preventing the loss of learning signal for rare yet critical tokens.
3. Algorithmic Workflow and Pseudocode
The CISPO routine proceeds as follows:
- Sample batches ( prompts), each with autoregressive responses under .
- Compute scalar rewards, group mean (), and standard deviation ().
- For each token :
- Calculate .
- Compute IS weight , clip to obtain .
- Calculate loss: .
- Update via AdamW optimizer; refresh every steps.
Pseudocode adheres to the element-wise computation of ratios and clipping (complexity where is token count per batch). This procedure requires no additional large-matrix computations or secondary networks.
4. Variance Reduction and Signal Preservation
Unclipped IS weights can yield outlier magnitude in gradient estimates, especially when the target policy assigns sharply increased probability to rare tokens in the sampled trajectories. Such variance destabilizes training. Clipping at (e.g., $0.3$) mitigates these spikes, leading to smoother reward and accuracy trajectories as observed empirically in MiniMax-M1 experiments.
Crucially, CISPO preserves the learning signal for all tokens, including those with low base probability but high subsequent reward (“Aha,” “Wait”). Empirical analysis demonstrates these are essential for chain-of-thought stability, in contrast to PPO's token-dropping tendency.
5. Computational Implementation and Complexity
The overhead of CISPO is negligible relative to standard RL objectives. Clipping IS weights is a simple operation and, in practice, both ratio computation and clipping are fused into a single GPU kernel. CISPO is compatible with mixed or FP32 precision heads, crucially maintaining congruence between training and inference token probabilities.
No value function estimators or large auxiliary matrices are required, reducing memory consumption and implementation complexity compared to certain actor–critic architectures.
6. Hyperparameter Choices and Ablations
Key CISPO hyperparameters:
- : typical range $0.2$–$0.4$ ($0.3$ offers optimal performance per ablation).
- Group size responses per prompt.
- Off-policy update refresh frequency .
- AdamW optimizer with learning rate , , , .
- Batch size set such that total tokens per step .
Ablation studies (Fig. 3 in (MiniMax et al., 16 Jun 2025)) show that yields superior variance reduction without sacrificing update bias.
7. Empirical Results and Comparative Performance
In AIME 2024 mathematical reasoning benchmark using Qwen2.5-32B-base, CISPO achieves accuracy after RL steps, compared to GRPO (), DAPO (). To match DAPO’s final performance, GRPO requires approximately more training, DAPO , while CISPO converges fastest.
CISPO’s sample efficiency is also noteworthy, reaching DAPO’s performance peak in approximately of the corresponding wall-clock time. Training trajectories are markedly smoother with fewer oscillations, validating the variance-reduction and signal preservation design. In summary, CISPO is a principled modification of the off-policy policy-gradient family: bounding IS weights rather than dropping surrogate objective terms, enabling faster, more stable RL training in large-scale reasoning models (MiniMax et al., 16 Jun 2025).