Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 179 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

GRPO Baselines in RL

Updated 7 October 2025
  • GRPO Baselines are a reinforcement learning framework that uses group-relative normalization to convert raw rewards into ordinal advantages.
  • The methodology shifts focus from absolute reward values to relative ranking among output samples, ensuring invariance to reward scaling and translation.
  • Its design, incorporating reverse KL regularization, has been effectively applied in systems like DeepSeek-R1-Zero, enhancing stability and data efficiency.

Group Relative Policy Optimization (GRPO) is a @@@@1@@@@ methodology for preference-based policy optimization. It has been adopted in LLMs, multimodal reasoning, and generative models, with its alignment strategy underpinning systems such as DeepSeek-R1-Zero and DeepSeekMath. GRPO stands apart from classical reinforcement learning from human feedback (RLHF) by using a groupwise, shift-and-scale normalized "advantage" to aggregate preferences among output samples, combined with a reverse-KL regularization penalty that discourages divergence from a reference policy. Recent work (Vojnovic et al., 25 Feb 2025) clarifies the formalism, contrasts it with logarithmic pooling, and elucidates both the theoretical properties and practical implications of varying the aggregation and penalty mechanisms.

1. Core Principles and Preference Aggregation

GRPO is predicated on aggregating feedback preferences not by weighted geometric/logarithmic pooling (as in RLHF), but through normalized, group-relative scaling of reference policy probabilities. The stationary policy arising from GRPO satisfies the fixed-point equation

πθ(oq)=g(PG(oπold(q),q)Eoπθ[PG(oπold(q),q)]β)πref(oq)\pi_\theta(o | q) = g\left(\frac{\mathcal{P}_G(o | \pi_\text{old}(\cdot|q), q) - \mathbb{E}_{o' \sim \pi_\theta} [\mathcal{P}_G(o' | \pi_\text{old}(\cdot|q),q)]}{\beta}\right) \cdot \pi_\text{ref}(o|q)

where g(x)=11xg(x) = \frac{1}{1-x}, and PG(o)\mathcal{P}_G(o|\cdot) encodes the group-relative preference.

This is fundamentally distinct from the log-pooling used in RLHF: πθ(oq)πref(oq)exp(r(oq)/β)\pi_\theta(o | q) \propto \pi_\text{ref}(o | q) \cdot \exp(r(o|q)/\beta) Here, aggregation in GRPO proceeds by adjusting the reference probabilities in an additive fashion (shifted by the groupwise advantage, then rescaled), rather than multiplicatively through exponentiated rewards.

This approach de-emphasizes absolute reward values, focusing instead on relative ranking within a sampled batch. Shift (mean-subtraction) and scale (division by group standard deviation) normalization ensures invariance to reward translation and scaling, with optimization centering on relative, not absolute, reward structure.

2. Construction of the Group-Relative Advantage

Given a group of GG outputs (responses) {oi}\{o_i\} sampled from policy πold(q)\pi_\text{old}(\cdot|q), each is assigned an advantage: Ai=rimean(r1,...,rG)std(r1,...,rG)A_i = \frac{r_i - \operatorname{mean}(r_1, ..., r_G)}{\operatorname{std}(r_1, ..., r_G)} This normalization is essential for shifting the learning signal from raw reward magnitudes to ordinal relationships—the position of each output relative to its peers.

The aggregate reward preference is then an expectation under the candidate policy: RG(θq)=Eoπθ(q)[PG(oπold(q),q)]\mathcal{R}_G(\theta|q) = \mathbb{E}_{o \sim \pi_\theta(\cdot|q)} [\mathcal{P}_G(o|\, \pi_\text{old}(\cdot|q), q)] For G=2G=2, this procedure yields pairwise comparison feedback, coinciding with the feedback structure in methods such as NLHF, but GRPO generalizes seamlessly to larger groups.

In the limit of large group size (GG \rightarrow \infty), the per-sample advantage converges to the normalized mean difference of expected rewards, decoupling from group variability.

3. Regularization and Penalty Function: Reverse KL

GRPO employs a divergence penalty to limit deviation from a trusted reference policy, expressed as

Di(θ)=πref(oiq)πold(oiq)log(πref(oiq)πold(oiq))1D_i(\theta) = \frac{\pi_\text{ref}(o_i|q)}{\pi_\text{old}(o_i|q)} - \log\left(\frac{\pi_\text{ref}(o_i|q)}{\pi_\text{old}(o_i|q)}\right) - 1

For small updates (πoldπθ\pi_\text{old} \approx \pi_\theta), the gradient reduces to

πref(oq)πθ(oq)+1- \frac{\pi_\text{ref}(o|q)}{\pi_\theta(o|q)} + 1

This matches the gradient of the reverse KL divergence, KLrev(πrefπθ)KL_\text{rev}(\pi_\text{ref} \parallel \pi_\theta), ensuring that the update discourages large deviations in the trajectory distribution and provides stability absent in pure reward-maximization.

Direct KL regularization was also considered, but its gradient properties (log-probability term) create different dynamics and can affect the uniqueness of the stationary policy.

4. Stationary Policy and Solution Characterization

The optimization objective for GRPO is

JGRPO(θ)=Eq{E{oi}πold[1Gi(A~i(θ)βDi(θ))]}\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_q \left\{ \mathbb{E}_{\{o_i\} \sim \pi_\text{old}} \left[ \frac{1}{G} \sum_i \left( \tilde{A}_i(\theta) - \beta D_i(\theta) \right) \right] \right\}

The Karush-Kuhn-Tucker (KKT) conditions for this problem yield explicit fixed-point equations:

[1PG(oπθ,q)E[PG(oπθ,q)]β]πθ(oq)=πref(oq)\left[1 - \frac{\mathcal{P}_G(o | \pi_\theta, q) - \mathbb{E}[\mathcal{P}_G(o' | \pi_\theta, q)]}{\beta}\right] \pi_\theta(o|q) = \pi_\text{ref}(o|q)

For binary choices, and for small groups, closed-form solutions emerge as functions of the confidence margin

γa,b=P(abq)P(baq)\gamma_{a,b} = \mathcal{P}(a \succ b | q) - \mathcal{P}(b \succ a | q)

and the regularization strength β\beta. These solutions explicate how the stationary policy interpolates between adhering to πref\pi_\text{ref} and shifting toward outputs with positive relative advantage.

In the case of two outputs, the mechanism is directly analogous to pairwise preference learning.

5. Relationship to Variants and RLHF

Several natural modifications of GRPO clarify its placement among alignment algorithms:

  • Direct KL penalty: Substituting the reverse KL with the direct KL changes the aggregation to a softmax structure—closer to logarithmic pooling—as in RLHF, at the cost of losing uniqueness or stability in some cases.
  • Drop of scale normalization: Removing normalization by standard deviation brings the update closer to classical RLHF objectives using absolute rewards, but the key invariances are lost.
  • Pairwise and groupwise limits: For G=2G=2, GRPO's feedback is strictly pairwise, while for larger G it smoothly transitions to a group ranking regime.

The distinction is summarized in the aggregation function:

Method Aggregation Mechanism Invariance
RLHF/log-pool Exponential softmax of rewards Sensitive to absolute reward scale
GRPO Nonlinear scaled difference from mean Invariant to reward translation/scaling

This design enables GRPO to generalize pairwise preference optimization to arbitrary group sizes and to interpolate between pure imitation of the reference and exploitation of groupwise advantages.

6. Summary of Theoretical and Practical Implications

GRPO offers several conceptual and operational benefits for reinforcement learning from preference feedback:

  • Preference aggregation via relative ranking sharpens the model's incentive to focus on outperforming the "reference" group, not on maximizing absolute metrics.
  • Shift-and-scale normalization guarantees that learning is robust to reward transformation and consistently emphasizes relative improvement.
  • Penalty equivalence with reverse KL confers regularization properties distinct from those in RLHF, potentially enhancing stability and introspectability of the policy.
  • Policy characterization by fixed-point equations allows precise theoretical analysis and, for certain classes of problems, explicit calculation of the optimal blend between reference-conformity and reward-directed learning.
  • Modification to direct KL or dropping normalization provides a direct pathway to interpolate or recover standard RLHF objectives, illustrating that GRPO encompasses several previously disparate preference aggregation strategies.

These properties collectively distinguish GRPO from canonical RLHF and provide a framework for principled preference aggregation, especially in settings where ordinal feedback and invariance to reward scale are desirable. This characterization is key to understanding recent advances in LLM alignment and provides theoretical grounding for the observed empirical stability and data efficiency of GRPO-based frameworks (Vojnovic et al., 25 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GRPO Baselines.