Papers
Topics
Authors
Recent
Search
2000 character limit reached

GRPO: Grouped Reward Policy Optimization

Updated 8 May 2026
  • Grouped Reward Policy Optimization is a reinforcement learning approach that groups model rollouts to assign normalized, token-level advantages.
  • It uses PPO-style clipped importance ratios to update policies while managing high-variance gradients and ensuring stability.
  • Extensions of GRPO enhance credit assignment, stability, and personalization across applications like reasoning, ASR, and multi-objective tasks.

Grouped Reward Policy Optimization (GRPO) is a critic-free, on-policy policy-gradient method used in reinforcement learning with verifiable rewards (RLVR), particularly for aligning LLMs on tasks demanding precise reasoning or structured output. GRPO proceeds by sampling groups of model rollouts per prompt, assigning standardized advantages relative to their within-group peer set, and then updating the policy with PPO-style clipped importance-ratio weighting at the token level. Its fine-grained credit assignment confers strong signal locality but also exposes the algorithm to high-variance gradients, frequent clipping, and potential training instabilities. GRPO has become foundational for post-training reinforcement learning in LLMs, and its limitations and extensions have motivated a series of methods with improved credit assignment, stability, and personalization.

1. GRPO Objective: Formal Definition and Implementation

Let qq denote a sampled prompt, and GG the group size (GG responses per prompt). For each sampled response oi=(oi,1,,oi,oi)o_i=(o_{i,1},\dots,o_{i,|o_i|}), a scalar reward rir_i (from a rule-based or automatic verifier) is assigned. The key steps are:

Ai=ri1Gj=1Grj1Gj=1G(rj1Gk=1Grk)2+ϵA_i = \frac{r_i - \frac{1}{G}\sum_{j=1}^G r_j}{\sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \frac{1}{G}\sum_{k=1}^G r_k)^2} + \epsilon}

where ϵ\epsilon is a small constant for numerical stability.

  • Token-level importance ratio:

ri,t(θ)=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}\mid q, o_{i,<t})}{\pi_{\theta_{\rm old}}(o_{i,t}\mid q, o_{i,<t})}

  • PPO-style clipped surrogate loss:

LGRPO(θ)=Eq,{oi}[1Gi=1G1oit=1oimin(ri,t(θ)Ai,clip(ri,t(θ),1ϵ,1+ϵ)Ai)]\mathcal{L}_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}}\left[ \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) A_i,\, \operatorname{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon) A_i \right) \right]

Optionally, a KL penalty to a reference policy can be included.

The update is implemented by batch sampling prompts, generating GG rollouts per prompt with the current behavior policy, computing group-normalized advantages, calculating per-token importance ratios and their clipped versions, and aggregating the surrogate loss for gradient-based parameter update (Min et al., 9 Jan 2026).

2. Theoretical Properties, Objective Geometry, and Preference Aggregation

The theoretical structure of GRPO is shaped by group normalization and reverse-KL regularization. Key aspects (Vojnovic et al., 25 Feb 2025, Mroueh, 9 Mar 2025):

  • Shift- and scale-invariance: Advantages are invariant under affine transformations of the reward scale within each group.
  • Contrastive preference model: For binary rewards, the GRPO update can be rewritten as a KL-regularized contrastive loss with explicit weighting between positive and negative outcomes sampled from the old policy.
  • Stationary solution nonlinearity: The stationary policy induced by (clipping-free) GRPO is not given by exponential weights (logarithmic pooling, as in standard RLHF), but by a rational function in the group-relative advantage and the regularization parameter, yielding different aggregation behavior and preference amplification (Vojnovic et al., 25 Feb 2025, Mroueh, 9 Mar 2025).
  • Success amplification: Under minimal regularity (binary verifiable tasks), iterative application of GRPO provably increases the probability of success above that of the reference policy, converging to a fixed point GG0 (Mroueh, 9 Mar 2025).

3. Practical Strengths and Limitations

Strengths

Limitations

  • High variance and instability: Token-level ratios may fluctuate wildly (especially for long sequences), leading to gradient noise, frequent clipping, and truncated updates (Min et al., 9 Jan 2026).
  • Uniform advantage within sequence: All tokens in a rollout share the same group-relative advantage; in chain-of-thought, this can cause sparse, noisy signal and poor token-level credit assignment (Lin et al., 14 Apr 2026, Lin et al., 10 Oct 2025).
  • Entropy collapse and mode degeneracy: Rapid reduction in policy entropy can produce excessively short and low-diversity outputs (Min et al., 9 Jan 2026).
  • Scaling and optimizer invariance issues: AdamW-based GRPO systems are approximately invariant to global reward scaling, diminishing the effect of tuning reward weights unless KL terms are used; length-based normalization and nonuniform group weighting can induce prefix bias (Fontana et al., 8 Jan 2026).

4. Algorithmic Extensions and Improvements

Multiple variants of the GRPO framework have been developed to address its fundamental weaknesses:

Variant Core Principle Addressed Limitation Source
GSPO Sequence-level importance ratio, shared for all tokens Reduces variance, aligns with outcome (Min et al., 9 Jan 2026)
TEPO Geometric mean of token-level IS, entropy-masked KL Stabilizes token updates, prevents collapse (Lin et al., 14 Apr 2026, Lin et al., 10 Oct 2025)
CW-GRPO LLM-judged per-round process weighting Fine-grained credit for process steps (Wang et al., 15 Apr 2026)
GRPO-VPS Process supervision via belief progression Attenuates indiscriminate step credit (Wang et al., 22 Apr 2026)
MO-GRPO Per-objective variance normalization Multi-objective “reward hacking (Ichihara et al., 26 Sep 2025)
MC-GRPO Median baseline + MAD for small G Stabilizes sign flips at low rollout (Kim, 30 Jan 2026)
λ-GRPO Process-set size normalization Corrects process-step over/under-penalization (Sullivan, 25 Sep 2025)
EP-GRPO Entropy-gated, progress-aligned advantages Resolves token granularity, polarity, variance collapse (Yu et al., 6 May 2026)
Personalized GRPO Per-group statistics for non-exchangeable preferences Heterogeneous preference alignment (Wang et al., 17 Feb 2026)
RC-GRPO Reward-token conditioning to induce within-group variance Restores update signal under flat rewards (Zhong et al., 3 Feb 2026)
F-GRPO Focal-loss difficulty scaling Recovers diversity, avoids rare-mode amnesia (Plyusov et al., 6 Feb 2026)
Pro-GRPO Online expand-and-prune group selection Maximizes reward spread, compute efficiency (Ge et al., 17 Dec 2025)

Algorithmic advances commonly focus on enriching the granularity of credit assignment (segment-wise, token-wise, process-wise), increasing training stability at small group sizes, or enhancing the expressivity of the optimization objective in multi-reward and heterogeneous-preference contexts. Empirical evaluations demonstrate consistent gains for these improvements over standard GRPO on mathematical, generative, search, translation, and tool-calling benchmarks, as well as in ASR (Lin et al., 14 Apr 2026, Wang et al., 15 Apr 2026, Ichihara et al., 26 Sep 2025, Yu et al., 6 May 2026, Kim, 30 Jan 2026, Wang et al., 17 Feb 2026, Ge et al., 17 Dec 2025, Plyusov et al., 6 Feb 2026, Shivakumar et al., 2 Sep 2025).

5. Implementation Details and Practical Considerations

Best practices and typical hyperparameter settings highlighted in the literature (Min et al., 9 Jan 2026, Kim, 30 Jan 2026, Pang et al., 4 Aug 2025) include:

  • Group size GG1: For stability, GG2–GG3 is common; MC-GRPO or careful regularization is recommended for GG4.
  • Clipping threshold: GG5 (symmetric) standard; adjust if excessive clipping or entropy collapse occurs.
  • Normalization: Always add a small GG6 to standard deviation to avoid divide-by-zero with homogeneous rewards.
  • Mini-batches: Mini-batch token splits with multiple epochs enhance stability.
  • Entropy or KL regularization: Useful to prevent premature collapse when not in vanilla RLVR.
  • Monitoring: Policy entropy and generated output length are sensitive collapse markers.
  • Learning rate: GG7 (Adam) often used, with warmup schedules in large-scale models.
  • For MC-GRPO, sample GG8 but only backpropagate through GG9 (excluding the median).
  • For multi-objective GRPO, normalize each objective separately (MO-GRPO, GDPO) to prevent reward hacking or collapse (Ichihara et al., 26 Sep 2025, Liu et al., 8 Jan 2026).

6. Applications and Empirical Performance

GRPO and its variants have been applied in multiple LLM-based domains:

7. Open Problems, Pitfalls, and Ongoing Directions

Despite its empirical impact, GRPO remains subject to:

  • Gradient and weighting pathologies: Non-uniform group weighting, optimizer invariance to reward scaling, and momentum-induced escape from the clipping region, collectively introduce hidden bias into the surrogate update (Fontana et al., 8 Jan 2026).
  • Process-step imbalance: The latent process reward model of GRPO over- or under-weights shared prefixes depending on group overlap size, leading to exploration/exploitation inefficacy. Normalizing the advantage by process-set size (as in λ-GRPO) corrects this bias with negligible cost (Sullivan, 25 Sep 2025).
  • Zero-variance and advantage-vanishing regimes: Discrete reward settings and peaked policies can result in high proportions of flat, zero-gradient updates. Techniques that induce artificial variance or reward diversity (RC-GRPO, F-GRPO, reward-variance increase at initialization (Yang et al., 29 May 2025)) mitigate this degenerate signal regime.

Current research seeks robust adaptive normalization strategies, per-token or per-step credit assignment, scalable process supervision, and optimal design of group splitting and pruning policies. Extensions for process feedback without final answer supervision (as in GRPO-VPS, CW-GRPO), and lightweight, learned process evaluators, are actively explored (Wang et al., 22 Apr 2026, Wang et al., 15 Apr 2026).


For a more detailed technical analysis, proofs of invariance and convergence rates, and comprehensive empirical baselines, see (Min et al., 9 Jan 2026, Vojnovic et al., 25 Feb 2025, Lin et al., 14 Apr 2026, Kim, 30 Jan 2026), and references therein.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grouped Reward Policy Optimization (GRPO).