Papers
Topics
Authors
Recent
Search
2000 character limit reached

Group-Relative REINFORCE: Enhancing Policy Gradients

Updated 4 July 2026
  • The paper introduces a group-relative advantage method that replaces learned value critics with intra-group baseline comparisons.
  • It details a theoretical formulation with PPO-style clipping and robust variants like MC-GRPO and F-GRPO that address bias and failure modes.
  • Empirical applications in mathematical reasoning, multimodal generation, and credit assignment demonstrate improved efficiency and training stability.

Group-Relative REINFORCE denotes a family of policy-gradient methods in which a policy samples multiple rollouts for the same prompt or state, computes a within-group relative reward signal, and updates the policy without relying on a learned value critic. In contemporary LLM post-training, this family is most commonly instantiated as Group Relative Policy Optimization (GRPO): a PPO-style clipped surrogate whose core gradient is a REINFORCE update with a group-relative baseline, usually the group mean or a normalized variant. Recent work has recast the method as a broader framework encompassing on-policy and off-policy training, robust advantage estimators, divergence-based objectives, ranking-based reward shaping, and application-specific adaptations in reasoning, instruction following, multimodal generation, and long-video context selection (Fang et al., 23 May 2025, Yao et al., 29 Sep 2025).

1. Canonical formulation

Classical REINFORCE optimizes the expected return

J(θ)=Eτπθ[R(τ)],J(\theta)=\mathbb E_{\tau\sim \pi_\theta}[R(\tau)],

with gradient

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].

Group-relative methods modify only the baseline construction. For a prompt xx, the policy samples a group of GG or KK rollouts y1,,yGy_1,\dots,y_G, obtains scalar rewards ri=r(x,yi)r_i=r(x,y_i), and replaces a learned critic by an intra-group baseline such as the group mean rˉ=1Giri\bar r=\frac1G\sum_i r_i. The raw group-relative advantage is Ai=rirˉA_i=r_i-\bar r; many implementations further standardize it as

Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.

The resulting REINFORCE-style update is

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].0

or, in tokenized autoregressive form, the same scalar θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].1 is attached to every token of rollout θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].2 (Fang et al., 23 May 2025, Yu et al., 6 May 2026).

In deployed GRPO systems this gradient is usually embedded in a PPO-style clipped surrogate. Writing

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].3

the objective takes the form

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].4

often with an added KL penalty to a reference policy. This construction preserves the critic-free character of REINFORCE while borrowing the trust-region heuristic of PPO.

2. Surrogate interpretation and off-policy reinterpretation

A central theoretical development is the observation that group-relative REINFORCE can be derived without assuming that sampled rollouts come from the current policy. For an arbitrary behavior distribution θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].5, one may consider the KL-regularized surrogate

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].6

whose exact optimum satisfies

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].7

Enforcing this condition on a finite group through a pairwise consistency loss and taking one gradient step at θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].8 yields

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].9

which is the familiar group-relative REINFORCE gradient. Because the derivation does not require xx0, the method admits a native off-policy interpretation (Yao et al., 29 Sep 2025).

From this view, two design principles follow. First, policy updates should be regularized, for example through clipping, KL penalties, or squared log-ratio penalties. Second, the data distribution can be actively shaped through weighting, sample dropping, replay reuse, stale-policy rollouts, or expert data. This framework also reframes a common misconception about importance sampling: ablations on GSM8k and ToolACE reported that removing importance sampling entirely but retaining clipping achieves near-identical performance to full GRPO, and widening the clipping range accelerates convergence without the collapse seen in vanilla REINFORCE. The same analysis reinterprets Online Policy Mirror Descent and Asymmetric REINFORCE as regularized forms of the REINFORCE loss rather than unrelated heuristics (Yao et al., 29 Sep 2025).

3. Statistical structure and failure modes

The apparent simplicity of the estimator conceals several nontrivial pathologies. Let xx1 denote the expected success rate for a prompt and xx2 the empirical group mean. The true advantage is xx3, whereas the group-relative estimate is xx4. Conditioned on non-degenerate groups, the estimator is biased: it underestimates advantages for hard prompts with xx5, overestimates them for easy prompts with xx6, and is unbiased only at xx7. For typical small group sizes xx8, the probability of underestimation for hard prompts or overestimation for easy prompts exceeds xx9, and it exceeds GG0 in more extreme regimes (Yang et al., 13 Jan 2026).

A second issue is rare-mode omission. If GG1 denotes the policy mass on a rare-correct subtype and GG2 the overall success probability, then the probability that a group update is active while sampling no rare-correct trajectory is

GG3

This quantity is non-monotonic in group size. Under a one-step TRPO-style surrogate analysis, unsampled-but-correct mass GG4 can shrink even while total correct mass grows, implying that learning may concentrate probability on already common solutions while forgetting rare-correct ones (Plyusov et al., 6 Feb 2026).

A third line of analysis concerns what one paper terms Group Relative Advantage Estimation. Standardized group advantages satisfy

GG5

and, for binary rewards, the total positive and negative absolute weights are exactly matched. If GG6, then

GG7

which is symmetric under GG8 and maximized at GG9. This induces two limitations: unsampled logits are unchanged, so the update cannot actively push probability mass toward unseen correct paths, and the method implicitly prioritizes medium-difficulty groups rather than adapting difficulty focus over training (Yu et al., 5 Feb 2026).

Further failure modes arise in small-rollout or low-dispersion regimes. Noise in the shared mean baseline can cause advantage sign flips; relative to a KK0 oracle baseline, sign-flip rates rise to KK1–KK2 when KK3, and a causal injection experiment shows that even a KK4 sign-flip rate can degrade final accuracy by KK5 (Kim, 30 Jan 2026). In discrete multi-constraint settings, z-score normalization additionally exhibits low-variance amplification, mean-centering blindness, and zero-variance collapse: if all group rewards are identical, KK6 and the policy gradient vanishes (Salmani-Zarchi et al., 4 Jun 2026).

4. Advantage redesign and robust variants

Proposed remedies largely preserve the group-relative backbone while modifying the baseline, reward shaping, or weighting function. Several representative variants are summarized below.

Variant Core modification Reported effect
MC-GRPO Median baseline, MAD scaling, KK7 samples, drop zero-advantage pivot Reduces the gap between KK8 and KK9 to within y1,,yGy_1,\dots,y_G0
F-GRPO Focal weight y1,,yGy_1,\dots,y_G1 on group-relative advantages Pass@256 gains of y1,,yGy_1,\dots,y_G2–y1,,yGy_1,\dots,y_G3 points at y1,,yGy_1,\dots,y_G4
HA-DW History-aware weight y1,,yGy_1,\dots,y_G5 Average gains of y1,,yGy_1,\dots,y_G6 to y1,,yGy_1,\dots,y_G7 on Qwen3-4B/8B GRPO
A-GRAE Dynamic difficulty shift and asymmetric suppression of positive advantages Improves GRPO and variants across seven benchmarks
RLRR Rank-based PRR/HRR relative rewards with Ranking Reward Model y1,,yGy_1,\dots,y_G8–y1,,yGy_1,\dots,y_G9 pp over absolute-score baselines
MDP-GRPO Multi-temperature sampling, dual-anchor advantages, prospect shaping, asymmetric KL Up to ri=r(x,yi)r_i=r(x,y_i)0 strict-constraint improvement

MC-GRPO replaces the mean baseline by the sample median

ri=r(x,yi)r_i=r(x,y_i)1

uses the median absolute deviation for scaling, and excludes the unique pivot sample with zero advantage from backpropagation. The stated motivation is robustness to outliers and baseline-induced sign flips under small rollout budgets (Kim, 30 Jan 2026). F-GRPO instead retains mean-based group-relative structure but multiplies the advantage by a Focal-loss–inspired coefficient ri=r(x,yi)r_i=r(x,y_i)2, down-weighting prompts with high empirical success and thereby counteracting concentration on already easy cases (Plyusov et al., 6 Feb 2026). HA-DW introduces an evolving difficulty anchor ri=r(x,yi)r_i=r(x,y_i)3 and reweights each ri=r(x,yi)r_i=r(x,y_i)4 by a factor that amplifies under-estimated advantages on hard prompts and suppresses over-exploited actions on easy prompts (Yang et al., 13 Jan 2026).

Other proposals explicitly break symmetry or move from absolute scores to relative orderings. A-GRAE combines a sample-level dynamic difficulty shift with group-level asymmetric suppression of positive advantages, motivated by the claim that asymmetrically suppressing correct trajectories encourages exploration and that learning efficiency is maximized by an easy-to-hard transition over training (Yu et al., 5 Feb 2026). RLRR replaces absolute numerical rewards by relative rankings, either via Pure Relative Reward,

ri=r(x,yi)r_i=r(x,y_i)5

or Hybrid Relative Reward,

ri=r(x,yi)r_i=r(x,y_i)6

and supplements this with a listwise Ranking Reward Model that predicts intra-group orderings directly (Niu et al., 30 Jan 2026). MDP-GRPO addresses discrete low-dispersion rewards through multi-temperature sampling, a mixed advantage

ri=r(x,yi)r_i=r(x,y_i)7

prospect-theoretic shaping, and asymmetric KL penalties (Salmani-Zarchi et al., 4 Jun 2026).

Two broader generalizations alter the scalar baseline paradigm itself. LambdaPO defines a pairwise preference advantage

ri=r(x,yi)r_i=r(x,y_i)8

and augments sparse binary rewards with a semantic density reward derived from ROUGE-L Fri=r(x,yi)r_i=r(x,y_i)9 against a ground-truth solution trace (Yuan et al., 19 May 2026). f-GRPO replaces the standard ratio-times-advantage term with a variational rˉ=1Giri\bar r=\frac1G\sum_i r_i0-divergence estimator between above-average and below-average reward distributions, yielding a class of divergence-based on-policy objectives with theoretical average-reward improvement guarantees (Haldar et al., 5 Feb 2026).

5. Credit assignment, geometry, and rollout efficiency

Another strand of work argues that the main deficiency is not only the baseline but the granularity of credit assignment. EP-GRPO identifies three failures in vanilla GRPO: uniform token-level granularity, uniform polarity, and zero-variance collapse. Empirically, randomly replacing the top rˉ=1Giri\bar r=\frac1G\sum_i r_i1 highest-entropy tokens in correct solutions caused a rˉ=1Giri\bar r=\frac1G\sum_i r_i2 larger accuracy drop than perturbing the bottom rˉ=1Giri\bar r=\frac1G\sum_i r_i3; on 500 MATH problems, rˉ=1Giri\bar r=\frac1G\sum_i r_i4 of locally incorrect steps in correct sequences were rewarded and rˉ=1Giri\bar r=\frac1G\sum_i r_i5 of locally correct steps in incorrect sequences were penalized; and in a standard run on MATH with rˉ=1Giri\bar r=\frac1G\sum_i r_i6, rˉ=1Giri\bar r=\frac1G\sum_i r_i7 of training steps had zero reward variance, wasting over rˉ=1Giri\bar r=\frac1G\sum_i r_i8 million token-updates. EP-GRPO adds entropy-gated outcome weighting, implicit process signals anchored by sequence outcome, and cumulative entropy mapping, combining them in a final token-level advantage

rˉ=1Giri\bar r=\frac1G\sum_i r_i9

The method is explicitly designed to maintain gradient flow even when sequence-level reward variance collapses (Yu et al., 6 May 2026).

AERO addresses a different failure: wasted rollout compute when fixed-size groups are homogeneous. It partitions prompts after an exploration phase into rescue, partial, and high-success subsets, performs iterative rescue sampling for hard prompts, rejection-based pruning for mixed groups, and Bayesian posterior stabilization for all-correct or all-incorrect groups. With a Ai=rirˉA_i=r_i-\bar r0 prior, the posterior mean

Ai=rirˉA_i=r_i-\bar r1

replaces the degenerate empirical success rate in homogeneous groups, yielding nonzero standardized advantages. Under the same total rollout budget, AERO reports about Ai=rirˉA_i=r_i-\bar r2 lower total training compute and about Ai=rirˉA_i=r_i-\bar r3 lower wall-clock time per step on average while matching or improving Pass@8 and Avg@8 over GRPO (Zhang et al., 15 Feb 2026).

SALT analyzes why simply increasing the number of rollouts often fails to strengthen learning. The stated diagnosis is that per-rollout policy-gradient features concentrate into a low-rank, signed geometry, so group-relative coefficients cancel in a dominant shared subspace. SALT measures this with the participation ratio of the Gram matrix and an effective sample-size statistic

Ai=rirˉA_i=r_i-\bar r4

It then estimates a dominant subspace, decomposes coefficients into shared and residual channels, and reweights them as

Ai=rirˉA_i=r_i-\bar r5

before plugging them back into the surrogate. Reported gains are Ai=rirˉA_i=r_i-\bar r6–Ai=rirˉA_i=r_i-\bar r7 percentage points across reasoning benchmarks, together with Ai=rirˉA_i=r_i-\bar r8–Ai=rirˉA_i=r_i-\bar r9 higher participation ratio and Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.0–Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.1 larger Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.2 (Chang et al., 4 Jun 2026).

6. Applications and empirical landscape

Much of the recent literature evaluates group-relative REINFORCE on verifiable mathematical reasoning, including GSM8K, MATH-500, AIME24, AIME25, AMC23, Minerva, OlympiadBench, GPQA, and multimodal math benchmarks. On Qwen2.5-7B with Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.3, F-GRPO improved pass@256 from Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.4 for GRPO, Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.5 for DAPO, and Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.6 for CISPO, while preserving or improving pass@1 and without increasing compute (Plyusov et al., 6 Feb 2026). A-GRAE reported consistent gains across seven text and vision-language benchmarks, including Geo3K, MathVision, MathVerse, and HuatuoGPT-Vision (Yu et al., 5 Feb 2026). MDP-GRPO was evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, with up to Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.7 improvement in strict constraint satisfaction on Llama-3.2-3B while preserving MMLU and ARC performance (Salmani-Zarchi et al., 4 Jun 2026).

The framework is not limited to text reasoning. InfLVG applies GRPO to inference-time context selection for long video generation. A lightweight policy Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.8 scores past tokens, Plackett–Luce sampling selects the top-Ai=rirˉσ+ϵ,σ=1Gj(rjrˉ)2.A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.9 context tokens without replacement, and a hybrid reward

θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].00

combines identity preservation, prompt alignment, and artifact suppression. With group size θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].01, clipping θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].02, and top-θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].03 selection set to θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].04, the method reports video-length extension by up to θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].05, together with the introduction of the Cross-scene Video Benchmark and Event Prompt Set (Fang et al., 23 May 2025).

Open-ended generation motivates a different use of the same group-based machinery. RLRR evaluates writing tasks across Academic, Finance, Politics, Literature, Education, and Advertising, replacing unstable scalar reward-model outputs by bounded relative rankings. Reported gains are θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].06–θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].07 percentage points in WritingBench domains, and the fine-tuned Ranking Reward Model achieves the best scores in θJ(θ)=Eτ[R(τ)tθlogπθ(atst)].\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].08 domains (Niu et al., 30 Jan 2026). Code-generation evaluations also appear in the SALT study through MBPP with HumanEval tests, indicating that the group-relative update rule and its geometric pathologies extend beyond mathematical QA (Chang et al., 4 Jun 2026).

Taken together, the literature presents Group-Relative REINFORCE not as a single frozen algorithm but as a family of critic-free or critic-light policy-gradient procedures organized around one core operation: compare multiple rollouts for the same prompt, transform those comparisons into relative advantages, and apply a policy-gradient update under clipping or regularization. The main research frontier has therefore shifted from defining the estimator to controlling its bias, symmetry, granularity, exploration behavior, reward representation, and compute efficiency.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group-Relative REINFORCE.