Group Sequence Policy Optimization (GSPO)

Updated 26 July 2025

Group Sequence Policy Optimization (GSPO) is a reinforcement learning approach that replaces token-level importance weighting with sequence-level objectives to reduce gradient variance.
It incorporates optimistic lookahead and adaptive meta-gradient mechanisms to accelerate convergence and stabilize updates in large model training.
GSPO streamlines infrastructure by aligning reward granularity with policy updates, effectively supporting both dense and Mixture-of-Experts architectures.

Group Sequence Policy Optimization (GSPO) is a reinforcement learning algorithm developed to address stability, efficiency, and scalability in policy optimization—particularly in LLM training. GSPO departs from conventional token-level importance weighting by introducing a sequence-level objective, with the aim of aligning optimization units with the granularity at which rewards are assigned. When equipped with optimistic lookahead and adaptive meta-gradient mechanisms, GSPO accelerates convergence and stabilizes updates, supporting the training of both dense and Mixture-of-Experts (MoE) models.

1. Conceptual Framework and Motivation

GSPO originates from the observation that standard policy optimization—such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO)—relies on token-level importance weighting. This approach, while theoretically sound for per-token rewards, can introduce high-variance gradient estimates when rewards are assigned at the end of sequences, as is typical in LLM RL training.

In GSPO, the critical innovation is the use of sequence-level importance weighting:

The importance ratio for each group sample is defined as follows:

$s_i(\theta) = \left[ \frac{\pi_\theta(y_i | x)}{\pi_{\theta_{\text{old}}}(y_i | x)} \right]^{1/|y_i|}$

where $\pi_\theta$ and $\pi_{\theta_{\text{old}}}$ denote current and previous policy sequence likelihoods, respectively, and $|y_i|$ denotes the sequence length of candidate $i$ .

This length-normalized ratio, by construction, is less sensitive to local variability and high-variance token probabilities. It ensures the optimization is commensurate with the granularity of reward assignment, which is always at the complete-response level in most RLHF or supervised LLM fine-tuning scenarios.

Empirical studies confirm that sequence-level ratios avoid the pathological gradient explosion and instability that can emerge when multiplying many high-variance token-level weights, especially in long-form or multi-turn sequence tasks.

2. Core Algorithmic Formulation

The GSPO objective is:

$\mathcal{I}_{\mathrm{GSPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\} \sim \pi_{\theta_{\text{old}}}(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( s_i(\theta) \hat{A}_i, \mathrm{clip}(s_i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i \right) \right]$

where each sample's normalized group advantage is

$\hat{A}_i = \frac{r(x, y_i) - \mathrm{mean}\left(\{ r(x, y_j) \}\right)}{\mathrm{std}\left( \{ r(x, y_j) \} \right)}$

with $r(x, y_i)$ denoting the reward (e.g., from a reward model or task-specific metric).

The key technical choices include:

Clipping at sequence-level: Rather than applying the PPO clipping at each token, all importance sampling and clipping is sequence-level. This directly restricts off-policy samples by the proportion of the sequence that is sufficiently "on-policy."
Group normalization: As with GRPO, rewards are normalized within each group, which has a variance-reducing effect and enables critic-free advantage estimation.

The GSPO policy gradient, when ignoring clipping, is:

$\nabla_\theta \mathcal{I}_{\mathrm{GSPO}}(\theta) = \mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G} \sum_i s_i(\theta) \hat{A}_i \frac{1}{|y_i|} \sum_t \nabla_\theta \log \pi_\theta(y_{i, t} | x, y_{i, <t}) \right]$

This shows that every token within a sequence receives an equally weighted policy gradient, in contrast to GRPO where token weights can differ substantially due to local variability.

3. Optimistic Foresight and Adaptive Updates

Accelerating convergence in GSPO is achieved by incorporating "foresight" (optimism) and adaptivity:

Optimism: The update direction is informed not only by the latest gradient but also by predicted future gradients, implemented as an auto-regressive update:

$u_t = \beta \hat{g}_{t+1} + \mu(u_{t-1} - \beta \hat{g}_t)$

where $\hat{g}_{t+1}$ is a gradient forecast, $\beta$ is a learning parameter, and $\mu$ is a momentum term. The policy update proceeds via:

$z_{t+\frac{1}{2}} = z_t + \alpha u_t, \quad \pi_{t+1} = P_\Pi(z_{t+\frac{1}{2}})$

with $P_\Pi$ as the projection onto the feasible set (e.g., probability simplex).

Adaptivity/meta-gradient: The update rule itself is parameterized and adapted by meta-gradient learning:

$\eta_{t+1} = \eta_t - \nu \nabla_\eta f_t(\eta, Q_{t+1})$

where $f_t$ is a meta-objective measuring how close an updated policy is to the "optimistic target" after lookahead.

This joint approach ensures GSPO avoids both overshooting (when predictions are poor) and slow convergence (when too cautious), and provides adaptation at both per-update and meta-update levels.

4. Empirical Performance and MoE Stabilization

GSPO demonstrates superior empirical performance relative to group-based baselines such as GRPO:

Training stability: Reward curves show continuously improving training reward, even under challenging conditions. GSPO avoids instability and collapse phenomena observed in token-weighted methods.
Downstream benchmarks: On AIME’24 (Pass@1 over 32 samples), LiveCodeBench (Pass@1 over 8 samples), and CodeForces (Elo Rating), GSPO matches or outperforms prior algorithms.
Clipping fraction: GSPO clips a larger fraction of tokens per update, reflecting tighter control over sequence-level importance weights and a preference for on-policy updates.

MoE RL training is notably stabilized by GSPO. Traditional methods suffered from expert-activation volatility requiring Routing Replay (i.e., caching and replaying expert choices). Because GSPO's gradients depend only on the overall sequence likelihood, rather than on the dynamically shifting subset of expert tokens, it is robust to such variance and eliminates the need for auxiliary memory or communication-intensive heuristics.

5. Practical Advantages and Deployment Considerations

GSPO improves the deployability and scalability of RL training for large models:

Infrastructure simplicity: Sequence-likelihood-based design allows the same probabilities computed by inference engines (e.g., SGLang, vLLM) to be used directly in optimization. This eliminates the need to synchronize or recompute per-token probabilities in training engines.
Gradient robustness: The equally weighted token gradients within each response reduce exposure to numerical noise arising from mixed-precision or low-rank approximations in distributed/federated settings.
Multi-stage RL and partial-generation: The framework handles partial rollouts and multi-turn tasks without additional design complexity, as the clipping and likelihood calculation remain sequence-coherent.

Practitioners should tune the clipping parameter $\epsilon$ to balance exploitation and exploration. The lookahead horizon $h$ and optimizer choice for the meta-learner also affect convergence speed and cumulative regret; empirical studies find that lookahead of $h>0$ reliably improves efficiency.

6. Relationship to and Progress Beyond GRPO

GSPO's principal advantages over GRPO (Group Relative Policy Optimization) and related methods are summarized in the following table:

Aspect	GRPO	GSPO
Importance ratio granularity	Token-level	Sequence-level
Clipping	Per-token	Per-sequence
Gradient variance	High, accumulates per-token	Low, evenly distributed
MoE expert handling	Unstable, needs replay	Robust, native stability
Reward alignment	Sometimes inconsistent	Consistent, sequencewise
Infrastructure complexity	High (token alignment)	Low (sequence alignment)

By aligning the optimization unit with the reward unit, GSPO mitigates the inherently high variance of per-token reweighting, stabilizes updates across long sequences and expert switches, and markedly simplifies reinforcement learning infrastructure for both dense and MoE LLMs. This foundation has enabled advancements in models such as Qwen3.

7. Theoretical Guarantees and Design Flexibility

GSPO provides theoretical and practical motivation for merging foresight with adaptation in policy optimization:

Where optimistic gradient prediction is accurate, GSPO can achieve near-linear convergence rates—superior to the typical sublinear rates of standard policy gradients.
Adaptive meta-updates regularize step sizes dynamically based on prediction error, reducing both over-correction and lag in response to environmental change.
Group or batched updates within GSPO enable efficient statistical reuse, further reducing sample complexity.
The flexibility in lookahead design—local (refining the gradient within the current episode) or global/greedy (applying extra improvement steps)—allows tailored acceleration.

Overall, GSPO constitutes a shift toward robust, foresight-guided, and adaptively corrected policy optimization suitable for the demands of modern large-scale RL and LLM training.

PDF Markdown Chat (Upgrade)