Papers
Topics
Authors
Recent
Search
2000 character limit reached

Group Relative Policy Optimization (GRPO)

Updated 5 April 2026
  • GRPO is a reinforcement learning paradigm that normalizes rewards within groups to reduce variance and enhance stable policy updates.
  • It eliminates the need for a learned value function by centering and scaling finite sample rewards, thereby promoting effective exploration.
  • The framework's extensions and theoretical grounding make it applicable to language models, robotics, molecular design, and multi-agent systems.

Group Relative Policy Optimization (GRPO) is a reinforcement learning (RL) paradigm that employs group-normalized advantage estimation and policy-gradient maximization based solely on finite samples within each prompt or input group. Originating in the context of LLM post-training with verifiable rewards, GRPO eliminates the need for a learned value function or critic by centering and scaling rewards intra-group, thereby stabilizing policy updates, reducing variance, and improving exploration. The approach applies to a broad array of domains, spanning language generation, representation learning, molecular optimization, robotics, and multi-agent systems. Its design, theoretical properties, and empirically observed benefits have led to widespread adoption in contemporary RL-for-LLM pipelines and extensions in multi-objective, modular, and process-reward oriented settings.

1. Mathematical Formalism and Surrogate Objective

GRPO operates by, for each context or prompt qq, sampling a group {oi}i=1G\{o_i\}_{i=1}^G of completions (e.g., sequences, class labels) from a fixed or slowly-evolving old policy πθold\pi_{\theta_{\rm old}}. For each output oio_i, the model receives a scalar reward rir_i (e.g., accuracy, binary correctness, task score). The core innovation is the intra-group normalization of these rewards, producing a per-sample group-relative advantage: Ai=ri−mean(r1,…,rG)std(r1,…,rG)+ϵA_i = \frac{r_i - \mathrm{mean}(r_1,\dots,r_G)}{\mathrm{std}(r_1,\dots,r_G) + \epsilon} where ϵ>0\epsilon>0 is for numerical stability. The policy is then updated to maximize, across the group,

JGRPO(θ)=Eq,{oi}∼πθold(⋅∣q)[∑i=1Gπθ(oi∣q)πθold(oi∣q)Ai−β KL(πθ∥πref)]\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q,\{o_i\}\sim \pi_{\theta_{\rm old}}(\cdot|q)}\left[\sum_{i=1}^G \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\rm old}}(o_i|q)} A_i - \beta\, \mathrm{KL}\bigl(\pi_{\theta}\|\pi_{\rm ref}\bigr) \right]

where πθ\pi_\theta is the current policy, πref\pi_{\rm ref} is an optional reference policy (e.g., initial LM), and {oi}i=1G\{o_i\}_{i=1}^G0 controls KL-regularization. Most practical implementations employ a PPO-style clipped surrogate objective to guarantee trust-region updates and prevent catastrophic policy drift: {oi}i=1G\{o_i\}_{i=1}^G1 with {oi}i=1G\{o_i\}_{i=1}^G2 and {oi}i=1G\{o_i\}_{i=1}^G3 typically set to 0.2.

2. Variance Reduction, Policy-Gradient Structure, and U-Statistic Theory

Central to GRPO's efficacy is its per-group baseline, which dramatically reduces variance relative to global or value-function baselines—especially in tasks with heterogeneously difficult contexts or prompts. Each group provides a local, action-agnostic measure of "typical" reward, centering the advantage estimator and achieving zero-mean gradient updates within the group ({oi}i=1G\{o_i\}_{i=1}^G4), as formalized in molecular optimization and mathematical reasoning (Javaid et al., 12 Feb 2026, Zhou et al., 1 Mar 2026). The policy gradient induced by GRPO is shown to be a U-statistic: {oi}i=1G\{o_i\}_{i=1}^G5 with {oi}i=1G\{o_i\}_{i=1}^G6 the symmetric kernel of centered paired gradients (Zhou et al., 1 Mar 2026). The mean squared error (MSE) of this gradient estimator can be precisely bounded and approaches the oracle (value-function) baseline as {oi}i=1G\{o_i\}_{i=1}^G7, making GRPO asymptotically optimal among a broad class of baseline-only estimators.

A universal scaling law governs group size: {oi}i=1G\{o_i\}_{i=1}^G8 balancing group and batch size for fixed compute (Zhou et al., 1 Mar 2026).

3. Theoretical Properties and Alignment Perspective

GRPO's objective departs from RLHF-style log-pooling aggregation. After reward normalization, the stationary policy update takes a rational-pooling form controlled by the group-relative preference function {oi}i=1G\{o_i\}_{i=1}^G9 and an effective reverse-KL penalty: πθold\pi_{\theta_{\rm old}}0 where πθold\pi_{\theta_{\rm old}}1. For πθold\pi_{\theta_{\rm old}}2 this reduces to pairwise preference comparison; for large πθold\pi_{\theta_{\rm old}}3, the normalization recovers mean/variance-normalized reward preference aggregation (Vojnovic et al., 25 Feb 2025). The framework explicitly distinguishes GRPO's aggregation from the exponential log-pooling of RLHF/NLHF, and shows that GRPO's KL term, when implemented as KLπθold\pi_{\theta_{\rm old}}4, converges to reverse KL at stationarity.

GRPO is also shown to secretly induce a process reward model (PRM) by propagating group-normalized, prefix-level advantages across tree-structured process sets. This implicit structure can introduce cardinality-weighted bias on repeated prefixes, which can be neutralized by dividing per-prefix contributions by their size (λ-GRPO) (Sullivan, 25 Sep 2025).

4. Extensions: Multi-Objective, Modular, and Robust Variants

GRPO has been adapted to several complex RL domains beyond standard LLM post-training:

  • Multi-objective reward normalization (MO-GRPO): GRPO is vulnerable to reward hacking when optimizing multiple objectives of different variances. MO-GRPO applies per-objective standardization before aggregation:

πθold\pi_{\theta_{\rm old}}5

ensuring each objective contributes equally, invariant under affine transformation (Ichihara et al., 26 Sep 2025).

  • Multi-module grouping (mmGRPO): For modular programs with multiple distinct prompting modules, mmGRPO aligns and groups outputs per module and invocation order, applying GRPO-style updates at the subcomponent level. This enables joint training of complex language systems under global, final-output reward signals (Ziems et al., 6 Aug 2025).
  • Robust clipping and adaptive boundaries: Vanilla GRPO's symmetric clipping can leak unbounded updates in certain quadrants of ratio-advantage space, leading to premature convergence (entropy collapse) and over-suppression. Adaptive-boundary extensions (ABC-GRPO) introduce independent clipping thresholds per sign quadrant (Liu et al., 7 Jan 2026). KL3-based asymmetric clipping further refines update control by enforcing a low-variance, analytically-known per-sample constraint (Wu et al., 5 Feb 2026).
  • Difficulty-aware scaling (F-GRPO): To guard against missing rare-correct trajectories at feasible group sizes, F-GRPO applies a focal-loss-inspired scaling πθold\pi_{\theta_{\rm old}}6 to downweight well-mastered prompts, improving diversity and pass@k without sacrificing single-shot performance (Plyusov et al., 6 Feb 2026).

5. Domain-Generalization: Beyond Language to Representation Learning, Molecular Design, Control, and Social Games

GRPO's group-relative normalization generalizes beyond text generation. In vision and representation learning, Group Relative Policy Optimization for Representation Models (GRPO-RM) fixes the output group as class labels, and defines rewards combining correctness (class accuracy) and a uniformity regularizer πθold\pi_{\theta_{\rm old}}7 to balance alignment and spread (Xu et al., 19 Nov 2025). This enables the application of reinforcement learning post-training to vision backbones, with empirical gains in both classification (up to +4.26% SR) and segmentation (up to +0.6 mIoU).

In molecular design, GRPO enables fast amortized optimization of molecular graphs via variance-reducing group normalization with respect to heterogeneous input scaffolds (Javaid et al., 12 Feb 2026).

In continuous control, GRPO is extended via trajectory-based policy clustering and state-aware advantage normalization, providing a unified, critic-free, and regularized policy gradient framework for robotics (Khanda et al., 25 Jul 2025).

In multi-agent systems, the introduction of global cooperation constraints (GRPO-GCC) on top of group-normalized advantages promotes robust, stable, and sustainable collective behavior in spatial public goods games, outperforming Q-learning and baseline reinforcement strategies in both onset and resilience of cooperation (Yang et al., 7 Oct 2025).

6. Empirical Properties, Implementational Insights, and Limitations

Comprehensive evaluations across language, vision, molecular, control, and multi-agent domains reveal consistent gains in sample efficiency, accuracy, and stability for GRPO variants. Empirical scaling laws predict optimal group size; group normalization shows robust empirical variance reduction; and modular, multi-objective, and process-level refinements yield further gains in alignment and performance.

Limitations and design biases have been thoroughly investigated (Fontana et al., 8 Jan 2026). Notably, non-uniform group weighting can induce structural gradient biases (e.g., over short or shared prefixes). AdamW optimizer dynamics can render training insensitive to global reward scaling and allow trust-region overshoot via momentum. Uniform weighting and momentum-aware adjustments are practical remedies. Additionally, vanilla GRPO surrogate loss is not always a reliable proxy for true reward improvement, and large group sizes are not a fundamental necessity for stable optimization in the contrastive-learning regime (Wu et al., 1 Oct 2025, Zhou et al., 1 Mar 2026).

Collectively, the GRPO family establishes group-relative normalization and policy gradients as state-of-the-art generators of stable, high-performance RL for LLMs and beyond, with principled theoretical backing and a rich ecosystem of targeted enhancements.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Reweighted Policy Optimization (GRPO).