GRPO Baselines in RL

Updated 7 October 2025

GRPO Baselines are a reinforcement learning framework that uses group-relative normalization to convert raw rewards into ordinal advantages.
The methodology shifts focus from absolute reward values to relative ranking among output samples, ensuring invariance to reward scaling and translation.
Its design, incorporating reverse KL regularization, has been effectively applied in systems like DeepSeek-R1-Zero, enhancing stability and data efficiency.

Group Relative Policy Optimization (GRPO) is a reinforcement learning methodology for preference-based policy optimization. It has been adopted in LLMs, multimodal reasoning, and generative models, with its alignment strategy underpinning systems such as DeepSeek-R1-Zero and DeepSeekMath. GRPO stands apart from classical reinforcement learning from human feedback (RLHF) by using a groupwise, shift-and-scale normalized "advantage" to aggregate preferences among output samples, combined with a reverse-KL regularization penalty that discourages divergence from a reference policy. Recent work (Vojnovic et al., 25 Feb 2025) clarifies the formalism, contrasts it with logarithmic pooling, and elucidates both the theoretical properties and practical implications of varying the aggregation and penalty mechanisms.

1. Core Principles and Preference Aggregation

GRPO is predicated on aggregating feedback preferences not by weighted geometric/logarithmic pooling (as in RLHF), but through normalized, group-relative scaling of reference policy probabilities. The stationary policy arising from GRPO satisfies the fixed-point equation

$\pi_\theta(o | q) = g\left(\frac{\mathcal{P}_G(o | \pi_\text{old}(\cdot|q), q) - \mathbb{E}_{o' \sim \pi_\theta} [\mathcal{P}_G(o' | \pi_\text{old}(\cdot|q),q)]}{\beta}\right) \cdot \pi_\text{ref}(o|q)$

where $g(x) = \frac{1}{1-x}$ , and $\mathcal{P}_G(o|\cdot)$ encodes the group-relative preference.

This is fundamentally distinct from the log-pooling used in RLHF: $\pi_\theta(o | q) \propto \pi_\text{ref}(o | q) \cdot \exp(r(o|q)/\beta)$ Here, aggregation in GRPO proceeds by adjusting the reference probabilities in an additive fashion (shifted by the groupwise advantage, then rescaled), rather than multiplicatively through exponentiated rewards.

This approach de-emphasizes absolute reward values, focusing instead on relative ranking within a sampled batch. Shift (mean-subtraction) and scale (division by group standard deviation) normalization ensures invariance to reward translation and scaling, with optimization centering on relative, not absolute, reward structure.

2. Construction of the Group-Relative Advantage

Given a group of $G$ outputs (responses) $\{o_i\}$ sampled from policy $\pi_\text{old}(\cdot|q)$ , each is assigned an advantage: $A_i = \frac{r_i - \operatorname{mean}(r_1, ..., r_G)}{\operatorname{std}(r_1, ..., r_G)}$ This normalization is essential for shifting the learning signal from raw reward magnitudes to ordinal relationships—the position of each output relative to its peers.

The aggregate reward preference is then an expectation under the candidate policy: $\mathcal{R}_G(\theta|q) = \mathbb{E}_{o \sim \pi_\theta(\cdot|q)} [\mathcal{P}_G(o|\, \pi_\text{old}(\cdot|q), q)]$ For $G=2$ , this procedure yields pairwise comparison feedback, coinciding with the feedback structure in methods such as NLHF, but GRPO generalizes seamlessly to larger groups.

In the limit of large group size ( $G \rightarrow \infty$ ), the per-sample advantage converges to the normalized mean difference of expected rewards, decoupling from group variability.

3. Regularization and Penalty Function: Reverse KL

GRPO employs a divergence penalty to limit deviation from a trusted reference policy, expressed as

$D_i(\theta) = \frac{\pi_\text{ref}(o_i|q)}{\pi_\text{old}(o_i|q)} - \log\left(\frac{\pi_\text{ref}(o_i|q)}{\pi_\text{old}(o_i|q)}\right) - 1$

For small updates ( $\pi_\text{old} \approx \pi_\theta$ ), the gradient reduces to

$- \frac{\pi_\text{ref}(o|q)}{\pi_\theta(o|q)} + 1$

This matches the gradient of the reverse KL divergence, $KL_\text{rev}(\pi_\text{ref} \parallel \pi_\theta)$ , ensuring that the update discourages large deviations in the trajectory distribution and provides stability absent in pure reward-maximization.

Direct KL regularization was also considered, but its gradient properties (log-probability term) create different dynamics and can affect the uniqueness of the stationary policy.

4. Stationary Policy and Solution Characterization

The optimization objective for GRPO is

$\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_q \left\{ \mathbb{E}_{\{o_i\} \sim \pi_\text{old}} \left[ \frac{1}{G} \sum_i \left( \tilde{A}_i(\theta) - \beta D_i(\theta) \right) \right] \right\}$

The Karush-Kuhn-Tucker (KKT) conditions for this problem yield explicit fixed-point equations:

$\left[1 - \frac{\mathcal{P}_G(o | \pi_\theta, q) - \mathbb{E}[\mathcal{P}_G(o' | \pi_\theta, q)]}{\beta}\right] \pi_\theta(o|q) = \pi_\text{ref}(o|q)$

For binary choices, and for small groups, closed-form solutions emerge as functions of the confidence margin

$\gamma_{a,b} = \mathcal{P}(a \succ b | q) - \mathcal{P}(b \succ a | q)$

and the regularization strength $\beta$ . These solutions explicate how the stationary policy interpolates between adhering to $\pi_\text{ref}$ and shifting toward outputs with positive relative advantage.

In the case of two outputs, the mechanism is directly analogous to pairwise preference learning.

5. Relationship to Variants and RLHF

Several natural modifications of GRPO clarify its placement among alignment algorithms:

Direct KL penalty: Substituting the reverse KL with the direct KL changes the aggregation to a softmax structure—closer to logarithmic pooling—as in RLHF, at the cost of losing uniqueness or stability in some cases.
Drop of scale normalization: Removing normalization by standard deviation brings the update closer to classical RLHF objectives using absolute rewards, but the key invariances are lost.
Pairwise and groupwise limits: For $G=2$ , GRPO's feedback is strictly pairwise, while for larger G it smoothly transitions to a group ranking regime.

The distinction is summarized in the aggregation function:

Method	Aggregation Mechanism	Invariance
RLHF/log-pool	Exponential softmax of rewards	Sensitive to absolute reward scale
GRPO	Nonlinear scaled difference from mean	Invariant to reward translation/scaling

This design enables GRPO to generalize pairwise preference optimization to arbitrary group sizes and to interpolate between pure imitation of the reference and exploitation of groupwise advantages.

6. Summary of Theoretical and Practical Implications

GRPO offers several conceptual and operational benefits for reinforcement learning from preference feedback:

Preference aggregation via relative ranking sharpens the model's incentive to focus on outperforming the "reference" group, not on maximizing absolute metrics.
Shift-and-scale normalization guarantees that learning is robust to reward transformation and consistently emphasizes relative improvement.
Penalty equivalence with reverse KL confers regularization properties distinct from those in RLHF, potentially enhancing stability and introspectability of the policy.
Policy characterization by fixed-point equations allows precise theoretical analysis and, for certain classes of problems, explicit calculation of the optimal blend between reference-conformity and reward-directed learning.
Modification to direct KL or dropping normalization provides a direct pathway to interpolate or recover standard RLHF objectives, illustrating that GRPO encompasses several previously disparate preference aggregation strategies.

These properties collectively distinguish GRPO from canonical RLHF and provide a framework for principled preference aggregation, especially in settings where ordinal feedback and invariance to reward scale are desirable. This characterization is key to understanding recent advances in LLM alignment and provides theoretical grounding for the observed empirical stability and data efficiency of GRPO-based frameworks (Vojnovic et al., 25 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

What is the Alignment Objective of GRPO? (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to GRPO Baselines.