Group-Relative REINFORCE: Enhancing Policy Gradients

Updated 4 July 2026

The paper introduces a group-relative advantage method that replaces learned value critics with intra-group baseline comparisons.
It details a theoretical formulation with PPO-style clipping and robust variants like MC-GRPO and F-GRPO that address bias and failure modes.
Empirical applications in mathematical reasoning, multimodal generation, and credit assignment demonstrate improved efficiency and training stability.

Group-Relative REINFORCE denotes a family of policy-gradient methods in which a policy samples multiple rollouts for the same prompt or state, computes a within-group relative reward signal, and updates the policy without relying on a learned value critic. In contemporary LLM post-training, this family is most commonly instantiated as Group Relative Policy Optimization (GRPO): a PPO-style clipped surrogate whose core gradient is a REINFORCE update with a group-relative baseline, usually the group mean or a normalized variant. Recent work has recast the method as a broader framework encompassing on-policy and off-policy training, robust advantage estimators, divergence-based objectives, ranking-based reward shaping, and application-specific adaptations in reasoning, instruction following, multimodal generation, and long-video context selection (Fang et al., 23 May 2025, Yao et al., 29 Sep 2025).

1. Canonical formulation

Classical REINFORCE optimizes the expected return

$J(\theta)=\mathbb E_{\tau\sim \pi_\theta}[R(\tau)],$

with gradient

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$

Group-relative methods modify only the baseline construction. For a prompt $x$ , the policy samples a group of $G$ or $K$ rollouts $y_1,\dots,y_G$ , obtains scalar rewards $r_i=r(x,y_i)$ , and replaces a learned critic by an intra-group baseline such as the group mean $\bar r=\frac1G\sum_i r_i$ . The raw group-relative advantage is $A_i=r_i-\bar r$ ; many implementations further standardize it as

$A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$

The resulting REINFORCE-style update is

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 0

or, in tokenized autoregressive form, the same scalar $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 1 is attached to every token of rollout $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 2 (Fang et al., 23 May 2025, Yu et al., 6 May 2026).

In deployed GRPO systems this gradient is usually embedded in a PPO-style clipped surrogate. Writing

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 3

the objective takes the form

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 4

often with an added KL penalty to a reference policy. This construction preserves the critic-free character of REINFORCE while borrowing the trust-region heuristic of PPO.

2. Surrogate interpretation and off-policy reinterpretation

A central theoretical development is the observation that group-relative REINFORCE can be derived without assuming that sampled rollouts come from the current policy. For an arbitrary behavior distribution $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 5, one may consider the KL-regularized surrogate

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 6

whose exact optimum satisfies

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 7

Enforcing this condition on a finite group through a pairwise consistency loss and taking one gradient step at $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 8 yields

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 9

which is the familiar group-relative REINFORCE gradient. Because the derivation does not require $x$ 0, the method admits a native off-policy interpretation (Yao et al., 29 Sep 2025).

From this view, two design principles follow. First, policy updates should be regularized, for example through clipping, KL penalties, or squared log-ratio penalties. Second, the data distribution can be actively shaped through weighting, sample dropping, replay reuse, stale-policy rollouts, or expert data. This framework also reframes a common misconception about importance sampling: ablations on GSM8k and ToolACE reported that removing importance sampling entirely but retaining clipping achieves near-identical performance to full GRPO, and widening the clipping range accelerates convergence without the collapse seen in vanilla REINFORCE. The same analysis reinterprets Online Policy Mirror Descent and Asymmetric REINFORCE as regularized forms of the REINFORCE loss rather than unrelated heuristics (Yao et al., 29 Sep 2025).

3. Statistical structure and failure modes

The apparent simplicity of the estimator conceals several nontrivial pathologies. Let $x$ 1 denote the expected success rate for a prompt and $x$ 2 the empirical group mean. The true advantage is $x$ 3, whereas the group-relative estimate is $x$ 4. Conditioned on non-degenerate groups, the estimator is biased: it underestimates advantages for hard prompts with $x$ 5, overestimates them for easy prompts with $x$ 6, and is unbiased only at $x$ 7. For typical small group sizes $x$ 8, the probability of underestimation for hard prompts or overestimation for easy prompts exceeds $x$ 9, and it exceeds $G$ 0 in more extreme regimes (Yang et al., 13 Jan 2026).

A second issue is rare-mode omission. If $G$ 1 denotes the policy mass on a rare-correct subtype and $G$ 2 the overall success probability, then the probability that a group update is active while sampling no rare-correct trajectory is

$G$ 3

This quantity is non-monotonic in group size. Under a one-step TRPO-style surrogate analysis, unsampled-but-correct mass $G$ 4 can shrink even while total correct mass grows, implying that learning may concentrate probability on already common solutions while forgetting rare-correct ones (Plyusov et al., 6 Feb 2026).

A third line of analysis concerns what one paper terms Group Relative Advantage Estimation. Standardized group advantages satisfy

$G$ 5

and, for binary rewards, the total positive and negative absolute weights are exactly matched. If $G$ 6, then

$G$ 7

which is symmetric under $G$ 8 and maximized at $G$ 9. This induces two limitations: unsampled logits are unchanged, so the update cannot actively push probability mass toward unseen correct paths, and the method implicitly prioritizes medium-difficulty groups rather than adapting difficulty focus over training (Yu et al., 5 Feb 2026).

Further failure modes arise in small-rollout or low-dispersion regimes. Noise in the shared mean baseline can cause advantage sign flips; relative to a $K$ 0 oracle baseline, sign-flip rates rise to $K$ 1– $K$ 2 when $K$ 3, and a causal injection experiment shows that even a $K$ 4 sign-flip rate can degrade final accuracy by $K$ 5 (Kim, 30 Jan 2026). In discrete multi-constraint settings, z-score normalization additionally exhibits low-variance amplification, mean-centering blindness, and zero-variance collapse: if all group rewards are identical, $K$ 6 and the policy gradient vanishes (Salmani-Zarchi et al., 4 Jun 2026).

4. Advantage redesign and robust variants

Proposed remedies largely preserve the group-relative backbone while modifying the baseline, reward shaping, or weighting function. Several representative variants are summarized below.

Variant	Core modification	Reported effect
MC-GRPO	Median baseline, MAD scaling, $K$ 7 samples, drop zero-advantage pivot	Reduces the gap between $K$ 8 and $K$ 9 to within $y_1,\dots,y_G$ 0
F-GRPO	Focal weight $y_1,\dots,y_G$ 1 on group-relative advantages	Pass@256 gains of $y_1,\dots,y_G$ 2– $y_1,\dots,y_G$ 3 points at $y_1,\dots,y_G$ 4
HA-DW	History-aware weight $y_1,\dots,y_G$ 5	Average gains of $y_1,\dots,y_G$ 6 to $y_1,\dots,y_G$ 7 on Qwen3-4B/8B GRPO
A-GRAE	Dynamic difficulty shift and asymmetric suppression of positive advantages	Improves GRPO and variants across seven benchmarks
RLRR	Rank-based PRR/HRR relative rewards with Ranking Reward Model	$y_1,\dots,y_G$ 8– $y_1,\dots,y_G$ 9 pp over absolute-score baselines
MDP-GRPO	Multi-temperature sampling, dual-anchor advantages, prospect shaping, asymmetric KL	Up to $r_i=r(x,y_i)$ 0 strict-constraint improvement

MC-GRPO replaces the mean baseline by the sample median

$r_i=r(x,y_i)$ 1

uses the median absolute deviation for scaling, and excludes the unique pivot sample with zero advantage from backpropagation. The stated motivation is robustness to outliers and baseline-induced sign flips under small rollout budgets (Kim, 30 Jan 2026). F-GRPO instead retains mean-based group-relative structure but multiplies the advantage by a Focal-loss–inspired coefficient $r_i=r(x,y_i)$ 2, down-weighting prompts with high empirical success and thereby counteracting concentration on already easy cases (Plyusov et al., 6 Feb 2026). HA-DW introduces an evolving difficulty anchor $r_i=r(x,y_i)$ 3 and reweights each $r_i=r(x,y_i)$ 4 by a factor that amplifies under-estimated advantages on hard prompts and suppresses over-exploited actions on easy prompts (Yang et al., 13 Jan 2026).

Other proposals explicitly break symmetry or move from absolute scores to relative orderings. A-GRAE combines a sample-level dynamic difficulty shift with group-level asymmetric suppression of positive advantages, motivated by the claim that asymmetrically suppressing correct trajectories encourages exploration and that learning efficiency is maximized by an easy-to-hard transition over training (Yu et al., 5 Feb 2026). RLRR replaces absolute numerical rewards by relative rankings, either via Pure Relative Reward,

$r_i=r(x,y_i)$ 5

or Hybrid Relative Reward,

$r_i=r(x,y_i)$ 6

and supplements this with a listwise Ranking Reward Model that predicts intra-group orderings directly (Niu et al., 30 Jan 2026). MDP-GRPO addresses discrete low-dispersion rewards through multi-temperature sampling, a mixed advantage

$r_i=r(x,y_i)$ 7

prospect-theoretic shaping, and asymmetric KL penalties (Salmani-Zarchi et al., 4 Jun 2026).

Two broader generalizations alter the scalar baseline paradigm itself. LambdaPO defines a pairwise preference advantage

$r_i=r(x,y_i)$ 8

and augments sparse binary rewards with a semantic density reward derived from ROUGE-L F $r_i=r(x,y_i)$ 9 against a ground-truth solution trace (Yuan et al., 19 May 2026). f-GRPO replaces the standard ratio-times-advantage term with a variational $\bar r=\frac1G\sum_i r_i$ 0-divergence estimator between above-average and below-average reward distributions, yielding a class of divergence-based on-policy objectives with theoretical average-reward improvement guarantees (Haldar et al., 5 Feb 2026).

5. Credit assignment, geometry, and rollout efficiency

Another strand of work argues that the main deficiency is not only the baseline but the granularity of credit assignment. EP-GRPO identifies three failures in vanilla GRPO: uniform token-level granularity, uniform polarity, and zero-variance collapse. Empirically, randomly replacing the top $\bar r=\frac1G\sum_i r_i$ 1 highest-entropy tokens in correct solutions caused a $\bar r=\frac1G\sum_i r_i$ 2 larger accuracy drop than perturbing the bottom $\bar r=\frac1G\sum_i r_i$ 3; on 500 MATH problems, $\bar r=\frac1G\sum_i r_i$ 4 of locally incorrect steps in correct sequences were rewarded and $\bar r=\frac1G\sum_i r_i$ 5 of locally correct steps in incorrect sequences were penalized; and in a standard run on MATH with $\bar r=\frac1G\sum_i r_i$ 6, $\bar r=\frac1G\sum_i r_i$ 7 of training steps had zero reward variance, wasting over $\bar r=\frac1G\sum_i r_i$ 8 million token-updates. EP-GRPO adds entropy-gated outcome weighting, implicit process signals anchored by sequence outcome, and cumulative entropy mapping, combining them in a final token-level advantage

$\bar r=\frac1G\sum_i r_i$ 9

The method is explicitly designed to maintain gradient flow even when sequence-level reward variance collapses (Yu et al., 6 May 2026).

AERO addresses a different failure: wasted rollout compute when fixed-size groups are homogeneous. It partitions prompts after an exploration phase into rescue, partial, and high-success subsets, performs iterative rescue sampling for hard prompts, rejection-based pruning for mixed groups, and Bayesian posterior stabilization for all-correct or all-incorrect groups. With a $A_i=r_i-\bar r$ 0 prior, the posterior mean

$A_i=r_i-\bar r$ 1

replaces the degenerate empirical success rate in homogeneous groups, yielding nonzero standardized advantages. Under the same total rollout budget, AERO reports about $A_i=r_i-\bar r$ 2 lower total training compute and about $A_i=r_i-\bar r$ 3 lower wall-clock time per step on average while matching or improving Pass@8 and Avg@8 over GRPO (Zhang et al., 15 Feb 2026).

SALT analyzes why simply increasing the number of rollouts often fails to strengthen learning. The stated diagnosis is that per-rollout policy-gradient features concentrate into a low-rank, signed geometry, so group-relative coefficients cancel in a dominant shared subspace. SALT measures this with the participation ratio of the Gram matrix and an effective sample-size statistic

$A_i=r_i-\bar r$ 4

It then estimates a dominant subspace, decomposes coefficients into shared and residual channels, and reweights them as

$A_i=r_i-\bar r$ 5

before plugging them back into the surrogate. Reported gains are $A_i=r_i-\bar r$ 6– $A_i=r_i-\bar r$ 7 percentage points across reasoning benchmarks, together with $A_i=r_i-\bar r$ 8– $A_i=r_i-\bar r$ 9 higher participation ratio and $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 0– $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 1 larger $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 2 (Chang et al., 4 Jun 2026).

6. Applications and empirical landscape

Much of the recent literature evaluates group-relative REINFORCE on verifiable mathematical reasoning, including GSM8K, MATH-500, AIME24, AIME25, AMC23, Minerva, OlympiadBench, GPQA, and multimodal math benchmarks. On Qwen2.5-7B with $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 3, F-GRPO improved pass@256 from $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 4 for GRPO, $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 5 for DAPO, and $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 6 for CISPO, while preserving or improving pass@1 and without increasing compute (Plyusov et al., 6 Feb 2026). A-GRAE reported consistent gains across seven text and vision-language benchmarks, including Geo3K, MathVision, MathVerse, and HuatuoGPT-Vision (Yu et al., 5 Feb 2026). MDP-GRPO was evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, with up to $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 7 improvement in strict constraint satisfaction on Llama-3.2-3B while preserving MMLU and ARC performance (Salmani-Zarchi et al., 4 Jun 2026).

The framework is not limited to text reasoning. InfLVG applies GRPO to inference-time context selection for long video generation. A lightweight policy $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 8 scores past tokens, Plackett–Luce sampling selects the top- $A_i=\frac{r_i-\bar r}{\sigma+\epsilon},\qquad \sigma=\sqrt{\frac1G\sum_j (r_j-\bar r)^2}.$ 9 context tokens without replacement, and a hybrid reward

$\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 00

combines identity preservation, prompt alignment, and artifact suppression. With group size $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 01, clipping $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 02, and top- $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 03 selection set to $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 04, the method reports video-length extension by up to $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 05, together with the introduction of the Cross-scene Video Benchmark and Event Prompt Set (Fang et al., 23 May 2025).

Open-ended generation motivates a different use of the same group-based machinery. RLRR evaluates writing tasks across Academic, Finance, Politics, Literature, Education, and Advertising, replacing unstable scalar reward-model outputs by bounded relative rankings. Reported gains are $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 06– $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 07 percentage points in WritingBench domains, and the fine-tuned Ranking Reward Model achieves the best scores in $\nabla_\theta J(\theta)=\mathbb E_{\tau}\Bigl[R(\tau)\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\Bigr].$ 08 domains (Niu et al., 30 Jan 2026). Code-generation evaluations also appear in the SALT study through MBPP with HumanEval tests, indicating that the group-relative update rule and its geometric pathologies extend beyond mathematical QA (Chang et al., 4 Jun 2026).

Taken together, the literature presents Group-Relative REINFORCE not as a single frozen algorithm but as a family of critic-free or critic-light policy-gradient procedures organized around one core operation: compare multiple rollouts for the same prompt, transform those comparisons into relative advantages, and apply a policy-gradient update under clipping or regularization. The main research frontier has therefore shifted from defining the estimator to controlling its bias, symmetry, granularity, exploration behavior, reward representation, and compute efficiency.