Group-Relative Baseline in Policy Optimization

Updated 4 July 2026

Group-relative baseline is a technique that computes advantages by subtracting the group mean or applying z-score normalization to convert absolute rewards into relative credit signals.
It underpins GRPO and related critic-free RL algorithms, enabling robust performance in language model training, image captioning, and hyperparameter tuning.
Variants like MC-GRPO, P-GRPO, and OTB address issues such as noise, bias, and token-level heterogeneity to improve stability and adaptability.

Searching arXiv for papers on group-relative baselines and GRPO-related variants. Group-relative baseline is a critic-free baseline construction used in group-based policy optimization, most prominently in Group Relative Policy Optimization (GRPO) and related RLHF/RLVR methods for LLMs. For a fixed prompt, the policy samples multiple completions, evaluates each with a scalar reward, and computes advantages relative to statistics of that prompt-specific sample group rather than a learned value function. In its canonical form, if a prompt $x$ yields rewards $r_1,\dots,r_G$ , the shared baseline is the group mean $\bar r=\frac{1}{G}\sum_i r_i$ , and the advantage is $A_i=r_i-\bar r$ or $A_i=(r_i-\bar r)/(s_r+\epsilon)$ , where $s_r$ is the group standard deviation. This converts absolute rewards into a relative credit-assignment signal, ties all rollouts for the same prompt to a common reference, and underlies a large family of PPO-style critic-free algorithms (Kim, 30 Jan 2026, Wang et al., 17 Feb 2026, Ge et al., 30 Jan 2026).

1. Canonical formulation and update rule

In GRPO-family methods, a “group” is the set of $G$ concurrent completions sampled for the same prompt under a behavior policy. For a prompt $x$ , the policy $\pi_{\theta_{\text{old}}}$ samples completions $y_1,\dots,y_G$ , evaluates rewards $r_1,\dots,r_G$ 0, and computes a baseline shared across the group. The standard baseline is the group mean reward,

$r_1,\dots,r_G$ 1

with corresponding advantages

$r_1,\dots,r_G$ 2

Many implementations normalize by the group standard deviation to reduce scale sensitivity; in other formulations the same idea appears as

$r_1,\dots,r_G$ 3

with the same sequence-level advantage reused for every token of the completion (Kim, 30 Jan 2026, Wang et al., 17 Feb 2026).

These advantages are then inserted into standard policy-gradient objectives. Two widely used forms are a REINFORCE-style estimator,

$r_1,\dots,r_G$ 4

and a PPO-style clipped surrogate,

$r_1,\dots,r_G$ 5

In GRPO, the sequence-level advantage is broadcast to token-level updates via per-token importance ratios, and KL penalties or variant-specific regularizers can be added without changing the baseline construction itself (Kim, 30 Jan 2026).

This same pattern appears outside mainstream reasoning RL as a general critic-free device. In image captioning, a group of $r_1,\dots,r_G$ 6 captions is sampled for each image, rewards are CIDEr scores, and the advantage is $r_1,\dots,r_G$ 7 (Liang, 3 Mar 2025). In batched speech-emotion classification, the batch itself becomes the group, the baseline is the batch-average reward $r_1,\dots,r_G$ 8, and normalized advantages are used to gate sample selection (Gao et al., 6 Feb 2026). In hyperparameter optimization, GRPOformer defines the group baseline as $r_1,\dots,r_G$ 9 and the group-relative advantage as $\bar r=\frac{1}{G}\sum_i r_i$ 0 (Guo et al., 21 Sep 2025).

2. Statistical role and theoretical interpretations

The immediate purpose of the group-relative baseline is variance reduction without a learned critic. By comparing each completion to a shared within-prompt reference, GRPO removes prompt-specific biases, normalizes heterogeneous reward scales, and focuses learning on “best-vs-rest” distinctions within a prompt rather than on global reward levels (Kim, 30 Jan 2026). In the language of LambdaPO, subtracting the group mean reduces variance in REINFORCE because it subtracts a statistic that does not depend on the individual token choice, even though it collapses the cohort to scalar moments $\bar r=\frac{1}{G}\sum_i r_i$ 1 (Yuan et al., 19 May 2026).

Several papers provide more formal interpretations of what this baseline is doing. A local-curvature analysis argues that standard-deviation normalization implements an adaptive gradient matched to the local curvature of the sequence-level policy gradient: dividing by within-prompt reward standard deviation acts as an inverse-curvature proxy, so prompts with larger local variability receive smaller effective steps and prompts with smaller variability receive larger ones. Under mild conditions, that analysis shows a strictly improved convergence rate over unnormalized REINFORCE, and empirically identifies three phases—early acceleration, transition, and late-stage interference—in which the benefit of normalization varies with reward variance and feature orthogonality (Ge et al., 30 Jan 2026).

A separate theoretical line studies the GRPO gradient itself. One result is that the GRPO policy gradient is inherently a second-order U-statistic. Using a leave-one-out mean baseline, the estimator can be written with a symmetric kernel over sample pairs, and its Hoeffding decomposition shows that the leading term coincides with an oracle actor–critic estimator whose baseline is the true value $\bar r=\frac{1}{G}\sum_i r_i$ 2. On that view, GRPO is asymptotically equivalent to an oracle policy-gradient algorithm and attains asymptotically optimal performance within a broad class of policy-gradient algorithms; the same analysis also derives a universal scaling law for the optimal group size under a fixed sampling budget (Zhou et al., 1 Mar 2026).

These theoretical perspectives are complementary. One treats normalization as an adaptive preconditioner, another treats grouped centering as a U-statistic structure with oracle asymptotics, and both explain why the group-relative baseline is more than a purely heuristic replacement for a value network (Ge et al., 30 Jan 2026, Zhou et al., 1 Mar 2026).

3. Failure modes, bias, and instability

The principal criticisms of group-relative baselines concern what happens when the group statistics themselves are poor estimators. In low-budget regimes with small rollout counts, the shared mean baseline becomes noisy and sensitive to outliers. MC-GRPO isolates a “sign-flip problem”: when a single high-reward outlier shifts the group mean, some completions can receive the wrong advantage sign, so the intended update direction is reversed. Formally, a sign flip occurs when

$\bar r=\frac{1}{G}\sum_i r_i$ 3

Across Qwen3-1.7B/GSM8K, Qwen2.5-7B-Instruct/Math-500, and Llama-3.2-3B/Math-500, sign-flip rates are highest at $\bar r=\frac{1}{G}\sum_i r_i$ 4, and injecting synthetic sign flips degrades accuracy monotonically; even a $\bar r=\frac{1}{G}\sum_i r_i$ 5 sign-flip rate induces an approximately $\bar r=\frac{1}{G}\sum_i r_i$ 6 accuracy drop (Kim, 30 Jan 2026).

A different critique concerns statistical bias in RLVR. Under Bernoulli verifier rewards and conditioning on the effective-learning subset $\bar r=\frac{1}{G}\sum_i r_i$ 7, the group-relative baseline $\bar r=\frac{1}{G}\sum_i r_i$ 8 has conditional expectation

$\bar r=\frac{1}{G}\sum_i r_i$ 9

which implies that the corresponding group-relative advantage systematically underestimates hard prompts and overestimates easy prompts. The paper states the result directly: $A_i=r_i-\bar r$ 0 if $A_i=r_i-\bar r$ 1, $A_i=r_i-\bar r$ 2 if $A_i=r_i-\bar r$ 3, and equality holds only at $A_i=r_i-\bar r$ 4. In the practical small-group regime $A_i=r_i-\bar r$ 5, the probability of underestimation for hard prompts exceeds $A_i=r_i-\bar r$ 6, and exceeds $A_i=r_i-\bar r$ 7 for more extreme regimes $A_i=r_i-\bar r$ 8 or $A_i=r_i-\bar r$ 9 (Yang et al., 13 Jan 2026).

Low-dispersion reward regimes generate a third cluster of pathologies. MDP-GRPO formalizes three of them for multi-constraint instruction following with discrete rewards: low-variance amplification, mean-centering blindness, and zero-variance collapse. If $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 0 is tiny but nonzero, z-score normalization inflates small reward differences; if two groups have the same centered deviations but very different absolute means, they produce identical normalized advantages; and if all rewards are identical, $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 1 for all completions and the learning signal vanishes entirely (Salmani-Zarchi et al., 4 Jun 2026).

Long-horizon LLM-RL raises a fourth issue. Group-relative baselines are usually sequence-level, so they ignore token heterogeneity and how gradient noise accumulates across positions. The Optimal Token Baseline paper argues that standard group-based baselines overlook sequence heterogeneity, collapse sooner under small $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 2 or very long horizons, and are suboptimal relative to a token-aware baseline weighted by cumulative gradient energy (Li et al., 6 Feb 2026).

These strands do not all study the same object. This suggests a distinction between the grouped gradient estimator as a whole and the per-sample advantage estimator derived from finite, selected groups: one can obtain oracle-style asymptotic properties for the former while still diagnosing conditional bias, sign flips, or collapse in the latter (Zhou et al., 1 Mar 2026, Yang et al., 13 Jan 2026).

4. Robustified, adaptive, and personalized baselines

A large portion of the recent literature keeps the group-relative idea but modifies the baseline estimator, the normalization rule, or the update weighting to address specific failure modes.

Variant	Baseline or normalization	Main target
MC-GRPO	median baseline and MAD, with pivot exclusion	outlier sensitivity and sign flips
P-GRPO	preference-group-specific historical mean/std	heterogeneous user preferences
EBPO	shrinkage baseline $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 3	small- $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 4 variance and saturated failure
MDP-GRPO	dual-anchor advantage plus prospect shaping	low-dispersion discrete rewards
OTB	token-level baseline weighted by realized energy	long-horizon token heterogeneity
HA-DW	history-aware reweighting of advantages	difficulty-dependent bias

MC-GRPO replaces mean/std with median/MAD. For each prompt it samples $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 5 rollouts, defines

$A_i=(r_i-\bar r)/(s_r+\epsilon)$ 6

and computes

$A_i=(r_i-\bar r)/(s_r+\epsilon)$ 7

With an odd-sized group, exactly one rollout is the median and receives zero advantage; that pivot is excluded from backpropagation so the number of gradient-contributing samples remains $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 8. Empirically, MC-GRPO narrows the gap between $A_i=(r_i-\bar r)/(s_r+\epsilon)$ 9 and $s_r$ 0 on Qwen3-1.7B/GSM8K to within approximately $s_r$ 1 (Kim, 30 Jan 2026).

P-GRPO addresses a different failure mode: exchangeability assumptions break when rewards come from heterogeneous preference groups. It replaces the instantaneous batch baseline with a preference-group-specific historical baseline,

$s_r$ 2

where $s_r$ 3 are maintained online, with Welford’s algorithm preferred in the paper. This decouples advantage estimation from immediate batch composition and is intended to preserve minority preference signals that would otherwise be suppressed by a global in-group baseline (Wang et al., 17 Feb 2026).

EBPO regularizes local group statistics by shrinking them toward global reward statistics updated online. Its empirical Bayes baseline is

$s_r$ 4

followed by batch-level normalization of $s_r$ 5. The paper proves strictly lower MSE than the local sample mean, non-vanishing penalty signals in saturated failure, and bounded entropy decay relative to GRPO (Han et al., 5 Feb 2026).

MDP-GRPO stabilizes low-dispersion verifiable rewards with several coupled modifications. It uses multi-temperature sampling to increase reward dispersion, a goal-aware anchor

$s_r$ 6

and a mixed advantage

$s_r$ 7

where both components are passed through a bounded, asymmetric prospect-theoretic shaping function and combined with asymmetric KL regularization. The design specifically targets low-variance amplification, mean-centering blindness, and zero-variance collapse (Salmani-Zarchi et al., 4 Jun 2026).

OTB moves from sequence-level to token-level baselines. It defines realized energy

$s_r$ 8

derives the optimal token baseline

$s_r$ 9

and proposes the logit-gradient proxy $G$ 0 so that $G$ 1 can be computed from forward-pass probabilities alone. In experiments, OTB with $G$ 2 matches GRB with $G$ 3, reducing token consumption by more than $G$ 4 across both single-turn and tool-integrated reasoning tasks (Li et al., 6 Feb 2026).

HA-DW does not replace the group baseline directly; instead it rescales grouped advantages using a history-aware difficulty anchor $G$ 5. With

$G$ 6

the method amplifies learning on hard prompts and damps over-exploitation on easy ones. The paper proves bias reduction under an explicit range of $G$ 7 and reports consistent improvements across GRPO, GSPO, and DAPO on five math reasoning benchmarks (Yang et al., 13 Jan 2026).

5. From scalar baselines to pairwise, ranking, and consensus signals

Another research direction holds that the scalar group-relative baseline is itself an information bottleneck. LambdaPO explicitly contrasts GRPO’s “monolithic statistical baseline” with a decomposed, pairwise structure. Instead of

$G$ 8

it defines a pairwise decomposed advantage

$G$ 9

where $x$ 0 and $x$ 1. The claim is that GRPO’s scalar baseline is permutation invariant and suppresses fine-grained rank orderings, whereas the pairwise structure preserves relational topology and yields directionally coherent, self-annealing gradients (Yuan et al., 19 May 2026).

A related shift replaces absolute rewards with relative rankings. RLRR argues that absolute numerical rewards create two problems: sparse supervision in verifiable tasks, where groups often become unanimously correct or incorrect, and instability in open-ended tasks, where scalar reward models have drifting score ranges. It therefore maps group outputs to ranks and then to bounded relative rewards, such as

$x$ 2

for pure relative reward, or a correctness-preserving hybrid mapping for verifiable tasks. Advantages are then computed from these rank-based signals using the same group-relative machinery. The paper reports that the fraction of effective prompts under absolute rewards drops below $x$ 3 at later training stages, whereas the rank-based formulation maintains near $x$ 4 utilization (Niu et al., 30 Jan 2026).

C-GRPO moves in a different direction by redefining the reward itself as a consensus score inside the sampled group. For a group $x$ 5, it sets

$x$ 6

uses $x$ 7, and then applies the ordinary group-relative baseline to those consensus utilities. Under ideal conditions, the resulting objective is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding. The practical motivation is amortization: consensus reranking is moved from inference time into training time, removing the $x$ 8 inference-time scoring cost associated with MBR (Ichihara et al., 3 Feb 2026).

BiCC and RCC retain the group-relative structure but expose additional structure hidden inside it. The paper shows that, under binary rewards, GRPO can be rewritten as maximizing the margin between average clipped ratios of correct and incorrect samples within a group. Bilateral Context Conditioning then cross-conditions correct samples on incorrect traces and vice versa, while Reward-Confidence Correction adjusts the baseline to

$x$ 9

where $\pi_{\theta_{\text{old}}}$ 0 is a confidence signal derived from sequence log-probability shift. RCC further reduces gradient variance by about $\pi_{\theta_{\text{old}}}$ 1 for Qwen and $\pi_{\theta_{\text{old}}}$ 2 for Phi in the reported analyses (Li et al., 13 Mar 2026).

Taken together, these methods treat the scalar group mean not as a fixed endpoint but as one point in a wider design space: robust centers, historical baselines, token-aware baselines, pairwise comparisons, listwise rankings, and consensus utilities all preserve the critic-free spirit while altering the information retained from the sampled group (Yuan et al., 19 May 2026, Niu et al., 30 Jan 2026, Ichihara et al., 3 Feb 2026, Li et al., 13 Mar 2026).

6. Applications, empirical behavior, and broader usage

The group-relative baseline now appears across a wide range of domains. In reasoning RL and instruction following it is central to GRPO, DAPO, GSPO, DR-GRPO, and their variants. In image captioning, replacing SCST’s single greedy baseline with a $\pi_{\theta_{\text{old}}}$ 3 group-relative baseline improved CIDEr from $\pi_{\theta_{\text{old}}}$ 4 to $\pi_{\theta_{\text{old}}}$ 5 on the MSCOCO Karpathy test split, while also improving BLEU-4 from $\pi_{\theta_{\text{old}}}$ 6 to $\pi_{\theta_{\text{old}}}$ 7 and SPICE from $\pi_{\theta_{\text{old}}}$ 8 to $\pi_{\theta_{\text{old}}}$ 9 (Liang, 3 Mar 2025). In crowd counting, the same mean/std group-relative baseline combined with a fuzzy reward function produced FGRPR, which surpassed all baseline models across five in-domain datasets and performed comparably to SFT out of domain, with particular advantage at larger target counts (Wang et al., 31 Mar 2025).

In neural combinatorial optimization, GRPO’s intra-group normalization has been evaluated as a baseline-free alternative to rollout baselines. On TSP-100 within RL4CO, REINFORCE with a rollout baseline collapsed from cost $y_1,\dots,y_G$ 0 to $y_1,\dots,y_G$ 1 immediately after warmup and did not recover under extended training, whereas GRPO avoided that collapse and achieved solution quality within $y_1,\dots,y_G$ 2 of POMO at matched gradient updates (Sepúlveda et al., 9 Jun 2026). In hyperparameter optimization, GRPOformer uses the mean within-group reward as its baseline, and ablating GRPO reduces BtR from $y_1,\dots,y_G$ 3 to $y_1,\dots,y_G$ 4 while worsening MnR from $y_1,\dots,y_G$ 5 to $y_1,\dots,y_G$ 6 (Guo et al., 21 Sep 2025).

In unsupervised speech emotion recognition, B-GRPO adapts the baseline to the batch level, uses the batch-average reward $y_1,\dots,y_G$ 7 and positive-truncated normalized advantages, and reports an overall $y_1,\dots,y_G$ 8 improvement relative to the baseline without the RL stage (Gao et al., 6 Feb 2026). In bias mitigation for LLMs, BiasGRPO replaces PPO’s critic baseline with a group-relative mean/std baseline over $y_1,\dots,y_G$ 9 completions per prompt and reports substantially lower training variance: average reward standard deviation is $r_1,\dots,r_G$ 00 for BiasGRPO versus $r_1,\dots,r_G$ 01 for PPO, while GRPO also outperforms DPO and PPO on BOLD, RealToxicityPrompts, and BBQ without degrading TruthfulQA (Reddy et al., 3 Jun 2026).

A recurring empirical pattern is that group size governs both quality and failure modes. Small $r_1,\dots,r_G$ 02 is cheap but exacerbates noisy baselines, sign flips, homogeneous groups, and weak variance estimates; large $r_1,\dots,r_G$ 03 stabilizes the baseline but increases sampling cost. Multiple papers therefore recommend small-to-moderate groups such as $r_1,\dots,r_G$ 04 or $r_1,\dots,r_G$ 05, often paired with high-throughput generation or with more robust baseline estimators, temperature schedules, or batch-level normalization when reward distributions are discrete, sparse, or highly subjective (Kim, 30 Jan 2026, Salmani-Zarchi et al., 4 Jun 2026, Sepúlveda et al., 9 Jun 2026).

Outside policy optimization, the phrase “group-relative baseline” also appears in a distinct sense in fair learning. There, a single global baseline predictor $r_1,\dots,r_G$ 06 is evaluated separately on each subgroup, and fairness is expressed through relative improvement

$r_1,\dots,r_G$ 07

which normalizes each group’s actual improvement by its own attainable improvement. That formulation recovers a Kalai–Smorodinsky bargaining solution rather than a policy-gradient baseline, but it preserves the core idea that a common baseline can be made group-relative through groupwise evaluation (Han et al., 4 Feb 2026).

In the dominant GRPO sense, however, group-relative baseline denotes the within-group statistic used to center and often standardize rewards for multiple samples generated from the same prompt. Its importance lies in the fact that it removes the need for a learned critic, makes relative credit assignment immediate, and exposes a rich design space of robust, adaptive, personalized, pairwise, and ranking-based generalizations. The contemporary literature treats it neither as a settled primitive nor as a disposable heuristic, but as a central object whose statistical properties determine the stability, efficiency, and representational granularity of critic-free policy optimization (Kim, 30 Jan 2026, Ge et al., 30 Jan 2026, Zhou et al., 1 Mar 2026).