Group Sequence Policy Optimization (2507.18071v2)

Published 24 Jul 2025 in cs.LG, cs.AI, and cs.CL

Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training LLMs. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Summary

The paper introduces a sequence-level off-policy correction method that replaces high-variance token-level importance weighting in RL.
It leverages a normalized importance ratio computed over entire sequences to enhance stability and efficiency, particularly in MoE architectures.
Empirical results show improved training accuracy and benchmark performance compared to token-based methods, reducing the need for complex stabilization strategies.

Group Sequence Policy Optimization: A Sequence-Level Approach to Stable RL for LLMs

Introduction

Group Sequence Policy Optimization (GSPO) addresses critical stability and efficiency challenges in reinforcement learning (RL) for LLMs, particularly in the context of Mixture-of-Experts (MoE) architectures. The paper identifies fundamental flaws in token-level importance weighting, as used in Group Relative Policy Optimization (GRPO), and introduces a theoretically grounded, sequence-level alternative. GSPO aligns the unit of off-policy correction with the unit of reward, resulting in improved training stability, efficiency, and performance, and obviates the need for complex stabilization strategies such as Routing Replay in MoE RL.

Motivation and Theoretical Foundations

The instability of GRPO is traced to its misapplication of importance sampling at the token level. In GRPO, the importance ratio is computed for each token as $w_{i,t}(\theta) = \frac{ \pi_{\theta} (y_{i,t} | x, y_{i,<t}) }{ \pi_{\theta_\text{old}} (y_{i,t} | x, y_{i,<t}) }$ , but this approach introduces high-variance noise, especially for long sequences, due to the lack of averaging over multiple samples as required by importance sampling theory. This noise accumulates and is exacerbated by the clipping mechanism, leading to catastrophic and often irreversible model collapse during RL training of large LLMs.

GSPO resolves this by defining the importance ratio at the sequence level:

$s_{i}(\theta) = \left( \frac{ \pi_{\theta} (y_i | x) }{ \pi_{\theta_\text{old}} (y_i | x) } \right)^{1/|y_i|}$

This formulation ensures that the off-policy correction matches the granularity of the reward, which is also sequence-level. The sequence-level importance ratio is further normalized by sequence length to control variance and maintain a consistent numerical range across varying response lengths.

Algorithmic Formulation

The GSPO objective is:

$\mathcal{J}_\text{GSPO} (\theta) = \mathbb{E}_{ x \sim \mathcal{D},\, \{y_i\}_{i=1}^G \sim \pi_{\theta_\text{old}}( \cdot | x) } \left[ \frac{1}{G} \sum_{i=1}^{G} \min \left( s_{i}(\theta) \widehat{A}_{i}, \, \mathrm{clip} \left( s_{i}(\theta), 1 - {\varepsilon}, 1 + {\varepsilon} \right) \widehat{A}_{i} \right) \right]$

where $\widehat{A}_{i}$ is the normalized group-based advantage for response $y_i$ .

The gradient of the GSPO objective is:

$\nabla_{\theta} \mathcal{J}_\text{GSPO} (\theta) = \mathbb{E}_{ x, \{y_i\} } \left[ \frac{1}{G} \sum_{i=1}^{G} s_{i}(\theta) \widehat{A}_{i} \cdot \nabla_{\theta} \log s_{i}(\theta) \right]$

This contrasts with GRPO, where token-level importance ratios introduce instability due to their high variance and sensitivity to expert routing in MoE models.

A token-level variant, GSPO-token, is also introduced for scenarios requiring finer-grained advantage adjustment, such as multi-turn RL. However, when all token advantages are set equal, GSPO-token is numerically equivalent to GSPO.

Empirical Results

GSPO demonstrates superior training stability and efficiency compared to GRPO across multiple benchmarks, including AIME'24, LiveCodeBench, and CodeForces. Notably, GSPO achieves higher training accuracy and benchmark performance under equivalent compute and data budgets.

Figure 1: Training curves of a cold-start model fine-tuned from Qwen3-30B-A3B-Base, showing GSPO's higher training efficiency compared to GRPO.

A key empirical observation is that GSPO clips a much larger fraction of tokens than GRPO—by two orders of magnitude—yet still achieves better training efficiency. This counter-intuitive result highlights the inefficiency and noisiness of GRPO's token-level gradient estimates, whereas GSPO's sequence-level approach provides a more reliable learning signal.

Figure 2: Average fractions of clipped tokens over the RL training of GSPO and GRPO, illustrating GSPO's higher clipping rate but superior efficiency.

MoE Training and Routing Replay

MoE models present unique challenges due to expert-activation volatility. In GRPO, the set of activated experts for a given response can change significantly after each gradient update, causing token-level importance ratios to fluctuate and destabilize training. The Routing Replay strategy, which caches and reuses expert routing decisions, is required for GRPO to converge in MoE RL but introduces additional memory and communication overhead.

GSPO eliminates the need for Routing Replay by focusing on sequence-level likelihoods, which are robust to changes in token-level expert activation. This simplification not only stabilizes training but also allows the MoE model to utilize its full capacity without artificial constraints.

Figure 3: The Routing Replay strategy is essential for GRPO's convergence in MoE RL, but GSPO obviates this requirement.

Implications for RL Infrastructure

GSPO's reliance on sequence-level likelihoods, rather than token-level, increases tolerance to precision discrepancies between training and inference engines. This enables direct use of inference engine likelihoods for optimization, reducing the need for recomputation and facilitating more efficient RL infrastructure, especially in disaggregated training-inference frameworks and multi-turn RL scenarios.

Conclusion

GSPO provides a theoretically sound and empirically validated solution to the instability and inefficiency of token-level importance weighting in RL for LLMs. By aligning the unit of off-policy correction with the unit of reward, GSPO achieves superior stability, efficiency, and performance, particularly in large-scale and MoE settings. The elimination of complex stabilization strategies and improved infrastructure compatibility position GSPO as a robust foundation for future RL scaling in LLMs. Future work may explore further extensions to multi-turn and partially observable RL, as well as integration with advanced reward modeling and credit assignment techniques.

PDF Markdown

Follow-up Questions

Related Papers

Authors (12)

Tweets

https://twitter.com/ChujieZheng/status/1948693461512904754

https://twitter.com/rosinality/status/1950055116758782102

https://twitter.com/fly51fly/status/1949222725459050565

https://twitter.com/_akhaliq/status/1948796559203991678

https://twitter.com/ADarmouni/status/1949586998966235183

https://twitter.com/DistStateAndMe/status/1949500860171710895