GRPO Baselines in RL
- GRPO Baselines are a reinforcement learning framework that uses group-relative normalization to convert raw rewards into ordinal advantages.
- The methodology shifts focus from absolute reward values to relative ranking among output samples, ensuring invariance to reward scaling and translation.
- Its design, incorporating reverse KL regularization, has been effectively applied in systems like DeepSeek-R1-Zero, enhancing stability and data efficiency.
Group Relative Policy Optimization (GRPO) is a @@@@1@@@@ methodology for preference-based policy optimization. It has been adopted in LLMs, multimodal reasoning, and generative models, with its alignment strategy underpinning systems such as DeepSeek-R1-Zero and DeepSeekMath. GRPO stands apart from classical reinforcement learning from human feedback (RLHF) by using a groupwise, shift-and-scale normalized "advantage" to aggregate preferences among output samples, combined with a reverse-KL regularization penalty that discourages divergence from a reference policy. Recent work (Vojnovic et al., 25 Feb 2025) clarifies the formalism, contrasts it with logarithmic pooling, and elucidates both the theoretical properties and practical implications of varying the aggregation and penalty mechanisms.
1. Core Principles and Preference Aggregation
GRPO is predicated on aggregating feedback preferences not by weighted geometric/logarithmic pooling (as in RLHF), but through normalized, group-relative scaling of reference policy probabilities. The stationary policy arising from GRPO satisfies the fixed-point equation
where , and encodes the group-relative preference.
This is fundamentally distinct from the log-pooling used in RLHF: Here, aggregation in GRPO proceeds by adjusting the reference probabilities in an additive fashion (shifted by the groupwise advantage, then rescaled), rather than multiplicatively through exponentiated rewards.
This approach de-emphasizes absolute reward values, focusing instead on relative ranking within a sampled batch. Shift (mean-subtraction) and scale (division by group standard deviation) normalization ensures invariance to reward translation and scaling, with optimization centering on relative, not absolute, reward structure.
2. Construction of the Group-Relative Advantage
Given a group of outputs (responses) sampled from policy , each is assigned an advantage: This normalization is essential for shifting the learning signal from raw reward magnitudes to ordinal relationships—the position of each output relative to its peers.
The aggregate reward preference is then an expectation under the candidate policy: For , this procedure yields pairwise comparison feedback, coinciding with the feedback structure in methods such as NLHF, but GRPO generalizes seamlessly to larger groups.
In the limit of large group size (), the per-sample advantage converges to the normalized mean difference of expected rewards, decoupling from group variability.
3. Regularization and Penalty Function: Reverse KL
GRPO employs a divergence penalty to limit deviation from a trusted reference policy, expressed as
For small updates (), the gradient reduces to
This matches the gradient of the reverse KL divergence, , ensuring that the update discourages large deviations in the trajectory distribution and provides stability absent in pure reward-maximization.
Direct KL regularization was also considered, but its gradient properties (log-probability term) create different dynamics and can affect the uniqueness of the stationary policy.
4. Stationary Policy and Solution Characterization
The optimization objective for GRPO is
The Karush-Kuhn-Tucker (KKT) conditions for this problem yield explicit fixed-point equations:
For binary choices, and for small groups, closed-form solutions emerge as functions of the confidence margin
and the regularization strength . These solutions explicate how the stationary policy interpolates between adhering to and shifting toward outputs with positive relative advantage.
In the case of two outputs, the mechanism is directly analogous to pairwise preference learning.
5. Relationship to Variants and RLHF
Several natural modifications of GRPO clarify its placement among alignment algorithms:
- Direct KL penalty: Substituting the reverse KL with the direct KL changes the aggregation to a softmax structure—closer to logarithmic pooling—as in RLHF, at the cost of losing uniqueness or stability in some cases.
- Drop of scale normalization: Removing normalization by standard deviation brings the update closer to classical RLHF objectives using absolute rewards, but the key invariances are lost.
- Pairwise and groupwise limits: For , GRPO's feedback is strictly pairwise, while for larger G it smoothly transitions to a group ranking regime.
The distinction is summarized in the aggregation function:
Method | Aggregation Mechanism | Invariance |
---|---|---|
RLHF/log-pool | Exponential softmax of rewards | Sensitive to absolute reward scale |
GRPO | Nonlinear scaled difference from mean | Invariant to reward translation/scaling |
This design enables GRPO to generalize pairwise preference optimization to arbitrary group sizes and to interpolate between pure imitation of the reference and exploitation of groupwise advantages.
6. Summary of Theoretical and Practical Implications
GRPO offers several conceptual and operational benefits for reinforcement learning from preference feedback:
- Preference aggregation via relative ranking sharpens the model's incentive to focus on outperforming the "reference" group, not on maximizing absolute metrics.
- Shift-and-scale normalization guarantees that learning is robust to reward transformation and consistently emphasizes relative improvement.
- Penalty equivalence with reverse KL confers regularization properties distinct from those in RLHF, potentially enhancing stability and introspectability of the policy.
- Policy characterization by fixed-point equations allows precise theoretical analysis and, for certain classes of problems, explicit calculation of the optimal blend between reference-conformity and reward-directed learning.
- Modification to direct KL or dropping normalization provides a direct pathway to interpolate or recover standard RLHF objectives, illustrating that GRPO encompasses several previously disparate preference aggregation strategies.
These properties collectively distinguish GRPO from canonical RLHF and provide a framework for principled preference aggregation, especially in settings where ordinal feedback and invariance to reward scale are desirable. This characterization is key to understanding recent advances in LLM alignment and provides theoretical grounding for the observed empirical stability and data efficiency of GRPO-based frameworks (Vojnovic et al., 25 Feb 2025).