Papers
Topics
Authors
Recent
2000 character limit reached

GRPO: Group-normalized PPO

Updated 12 January 2026
  • Group-normalized Proximal Policy Optimization (GRPO) is a reinforcement learning algorithm that replaces learned value functions with group-wise, reward-normalized advantage estimation.
  • It reduces gradient estimator variance and enhances policy update stability in high-dimensional settings like MoE transformers, sequence modeling, and combinatorial control.
  • GRPO leverages PPO’s clipped importance-ratio surrogate objective, achieving improved sample efficiency, reliable convergence, and robust performance across diverse applications.

Group-normalized Proximal Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm that generalizes and modifies Proximal Policy Optimization (PPO) by replacing value-function–based advantage estimation with group-wise, reward-normalized advantages derived from empirical sampling. The core motivation is to reduce gradient estimator variance and stabilize policy updates, particularly in settings where the value function is unreliable or hard to train, such as in Mixture-of-Experts (MoE) transformers, sequence modeling, LLM fine-tuning, combinatorial control, and real-world applications. GRPO achieves this by sampling groups of actions per state, computing a group-normalized relative advantage for each, and utilizing PPO’s clipped importance-ratio surrogate objective. Extensive empirical and theoretical work demonstrates that GRPO combines the convergence properties of PPO with improved variance properties and practical training stability (Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025, Guo et al., 21 Sep 2025, Zhang et al., 18 Sep 2025).

1. Foundations and Motivation

In PPO, policy updates rely on advantage estimates—typically computed as At=Qπold(st,at)Vπold(st)A_t = Q^{\pi_\text{old}}(s_t, a_t) - V^{\pi_\text{old}}(s_t)—using a learned critic (value function). PPO further stabilizes optimization with an importance-ratio–clipped surrogate objective and, optionally, a KL-divergence penalty. However, high-dimensional or architectural complexities (e.g., MoE transformers), unreliable value function estimation, and multi-objective tasks can compromise PPO’s stability and efficiency.

GRPO addresses these issues by:

  • Sampling a group (size GG) of actions per data point/state from the old policy,
  • Computing rewards for each action,
  • Calculating normalized, group-relative advantages, and
  • Applying PPO’s clipped surrogate objective and trust-region enforcement.

Group normalization lowers the variance of the gradient estimator and absorbs stochasticity from system properties, such as MoE expert routing (Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025, Guo et al., 21 Sep 2025, Zhang et al., 18 Sep 2025). GRPO also eliminates the need for a separate value critic—a key simplification that improves robustness and sample efficiency in complex domains.

2. Mathematical Formulation and Algorithmic Structure

The defining workflow of GRPO is as follows (Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025, Guo et al., 21 Sep 2025):

  1. Group Action Sampling:
    • For each state or data point (e.g., context, input prompt, state ss), sample a group A={a1,...,aG}\mathcal{A} = \{a_1, ..., a_G\} of actions independently using the old policy πold\pi_\text{old}.
  2. Group-Relative Advantage Computation:
    • For each aia_i, obtain reward rir_i.
    • Define the group mean μ\mu and (optionally stabilized) standard deviation σ\sigma over rewards:

    A^i=riμσ+δ\hat{A}_i = \frac{r_i - \mu}{\sigma + \delta}

    with δ0\delta\to 0 for numerical stability.

  3. Probability Ratio Calculation (Trust Region):

    • Compute ρi=πθ(ais)πold(ais)\rho_i = \frac{\pi_\theta(a_i|s)}{\pi_\text{old}(a_i|s)}
  4. Clipped Surrogate Loss:
    • For each ii,

    Lunclipped,i=ρiA^i\mathcal{L}_{\text{unclipped},i} = \rho_i \cdot \hat{A}_i

    Lclipped,i=clip(ρi,1ϵ,1+ϵ)A^i\mathcal{L}_{\text{clipped},i} = \mathrm{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \cdot \hat{A}_i

    Lpolicy=Ei[min(Lunclipped,i,Lclipped,i)]\mathcal{L}_{\text{policy}} = -\mathbb{E}_{i}[\,\min(\mathcal{L}_{\text{unclipped},i}, \mathcal{L}_{\text{clipped},i})]

  5. KL Regularization (optional):

    • KL(πoldπθ)\mathrm{KL}(\pi_\text{old} \| \pi_\theta) is added with a small coefficient λKL\lambda_{KL}.
  6. Total Loss:
    • Ltotal=Lpolicy+λKLKL\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{policy}} + \lambda_{KL} \cdot \mathrm{KL}
  7. Parameter Update:
    • Gradient descent/ascent is performed on Ltotal\mathcal{L}_{\text{total}}.

Key hyperparameters include group size GG, clipping parameter ϵ\epsilon, KL penalty weight λKL\lambda_{KL}, and optimizer learning rate α\alpha. Empirical studies suggest G=48G=4-8 suffices in transformer/MoE applications; larger GG improves variance reduction (Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025, Guo et al., 21 Sep 2025).

3. Theoretical Properties and Variance Reduction

GRPO inherits PPO’s trust-region properties through the clipped objective and KL penalty, ensuring that large policy updates are penalized and monotonic improvement is approximately preserved for sufficiently small steps (Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025).

Importantly, group-wise advantage normalization directly reduces the variance of policy gradient estimates:

VarGRPO(θ)<VarPPO(θ)\operatorname{Var}_{\text{GRPO}}(\nabla_\theta) < \operatorname{Var}_{\text{PPO}}(\nabla_\theta)

for fixed policy and sufficiently large group size, due to centering and scaling of within-group advantages (Togootogtokh et al., 5 Mar 2025). In the context of multi-objective or multi-expert systems, group normalization further shields the optimization from high-variance components introduced by architectural routing or reward noise.

A key distinction from PPO is that GRPO’s advantage estimator is unbiased but higher variance when group size is too small or reward distributions are highly stochastic. Hybrid approaches (e.g., Hybrid GRPO) combine group-based normalization with value-function bootstraps for further variance control (Sane, 30 Jan 2025).

4. Relationship and Comparison to PPO and Other Baselines

Standard PPO uses a single action per state and relies on a learned value function V(s)V(s) (baseline) for variance reduction. Advantages are typically computed with Generalized Advantage Estimation (GAE), which also depends on accurate critic learning.

In contrast, GRPO:

  • Does not require a learned critic or value network.
  • Uses empirical, group-based baselines for advantage estimation.
  • Enables easier extension to domains where value functions are hard to fit (e.g., intricate architectures, continuous control).

Empirically, GRPO delivers higher sample efficiency and reduced gradient variance in many RL tasks, including MoE voice pathology detection, simulated control, and bandit problems. For example, VoiceGRPO (MoE transformer + GRPO) achieved test accuracy $0.9860$ (vs. $0.9762$ PPO), F1 $0.9845$ (vs. $0.9794$), and ROC-AUC $0.9988$ (vs. $0.9984$) on synthetic voice pathology data (Togootogtokh et al., 5 Mar 2025). In hyperparameter optimization, GRPOformer (Transformer + GRPO) consistently outperformed Bayesian optimization and PPO-based methods across multiple OpenML tasks (Guo et al., 21 Sep 2025).

Comparison Table:

Method Value Function Advantage Type Sampled Actions Empirical Variance Trust Region
PPO Required GAE (critic-based) Single Lower (biased) Clipped+KL
GRPO None Group-normalized Group (GG) Lower (unbiased) Clipped+KL
Hybrid GRPO Optional Group + value Group Tunable Clipped+KL

5. Architectural and Practical Considerations

GRPO is particularly suited to settings with latent routing or activation, high-dimensional control, and RL fine-tuning from sparse or difficult reward signals. Notable architectural applications include:

  • MoE Transformers: GRPO controls expert-routing stochasticity and accelerates convergence (Togootogtokh et al., 5 Mar 2025).
  • Hyperparameter Optimization: In GRPOformer, group actions are candidate configurations, with regularization via Policy Churn Regularization (PCR) to control KL divergence between models (Guo et al., 21 Sep 2025).
  • Hardware/Communication: In fluid antenna systems, GRPO leverages only the policy network, reducing required model size and FLOPs by approximately 49.2%49.2\% compared to PPO with actor-critic architecture (Zhang et al., 18 Sep 2025).
  • Off-policy settings: Extension to off-policy GRPO enables reuse of prior samples for advantage normalization, with theoretical guarantees provided for both regimes (Mroueh et al., 28 May 2025).

Implementation recommendations include careful selection of group size GG, avoidance of degenerate (zero-variance) reward groups, and judicious tuning of the KL penalty for stability.

6. Empirical Performance and Domain-Specific Results

GRPO consistently improves policy stability and sample efficiency in diverse application settings:

  • Voice Pathology Detection: On synthetic datasets, VoiceGRPO outperforms MoE-PPO in accuracy, F1, ROC-AUC, and convergence speed (Togootogtokh et al., 5 Mar 2025).
  • Indoor Fluid Antenna Systems: GRPO leads to 515%5-15\% higher sum-rate on joint optimization tasks, using approximately half the compute of PPO (Zhang et al., 18 Sep 2025).
  • Hyperparameter Optimization: GRPOformer achieves state-of-the-art best-to-rank (BtR), median, and mean normalized performance across 36 tasks. Ablations indicate criticality of both GRPO and PCR components (Guo et al., 21 Sep 2025).
  • Controlled RL Benchmarks: GRPO equals or surpasses PPO in convergence speed when adjusted for sample count, with hybrid variants delivering further improvements (Sane, 30 Jan 2025).

7. Extensions and Variants

Several variants of GRPO have been proposed to further enhance its properties:

  • Hybrid GRPO: Blends group-based empirical normalization with value-function bootstrapping for stability (Sane, 30 Jan 2025).
  • Entropy Regularization: Adds an entropy bonus for exploration in the surrogate objective.
  • Hierarchical Multi-Step Sampling: Generalizes advantage computation to multiple environment steps.
  • Adaptive Reward Normalization: Employs batch or rolling normalization to stabilize reward scaling over time.
  • Value-Guided Sampling: Samples actions using a learned action-value critic for improved action-selection bias.

GRPO’s design has proven to be adaptable across RL, control, combinatorial optimization, fine-tuning large-scale models, and hyperparameter search (Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025, Guo et al., 21 Sep 2025, Zhang et al., 18 Sep 2025, Mroueh et al., 28 May 2025).


References:

(Togootogtokh et al., 5 Mar 2025, Sane, 30 Jan 2025, Guo et al., 21 Sep 2025, Zhang et al., 18 Sep 2025, Mroueh et al., 28 May 2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Group-normalized Proximal Policy Optimization (GRPO).