Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
78 tokens/sec
GPT-4o
77 tokens/sec
Gemini 2.5 Pro Pro
60 tokens/sec
o3 Pro
16 tokens/sec
GPT-4.1 Pro
66 tokens/sec
DeepSeek R1 via Azure Pro
34 tokens/sec
2000 character limit reached

Group Relative Policy Optimization (GRPO) in Reinforcement Learning

Last updated: June 10, 2025

Certainly! Here’s a polished, fact-faithful, and well-sourced article synthesizing key insights from "What is the Alignment Objective of GRPO?" (Vojnovic et al., 25 Feb 2025 ° ):


Group Relative Policy Optimization ° (GRPO °) is a reinforcement learning approach designed to aggregate preferences and align advanced AI models (such as DeepSeek-R1-Zero and DeepSeekMath) with desirable behaviors. GRPO’s algorithmic architecture and the resulting stationary policies ° differ substantially from standard methods used in RLHF ° (Reinforcement Learning from Human Feedback), introducing unique mathematical considerations and practical implementation strategies °.

GRPO’s Preference Aggregation Mechanism

GRPO aggregates preferences not by simply maximizing the likelihood of high-reward actions, but by evaluating outputs with respect to their relative standing within a sampled group under a fixed context. Given a context qq:

  1. Sample outputs: Generate GG outputs (o1,,oG)(o_1,\ldots,o_G) from the current policy.
  2. Reward assignment: Compute rewards rir_i for each output using a reward or preference model °.
  3. Shift-and-scale normalization: Normalize rewards within the group (not globally):

Ai=rimean(r1,,rG)std(r1,,rG) A_i = \frac{r_i - \mathrm{mean}(r_1,\ldots,r_G)}{\mathrm{std}(r_1,\ldots,r_G)}

These normalized advantages reflect how much better an output is compared to its local peer group, mitigating sensitivity to reward scale and absolute calibration.

  1. Policy update: The new policy πθ\pi_\theta is trained to increase the log-probability of outputs with higher group-normalized advantages while penalizing deviations from a reference policy °.

Objective Function

The expected GRPO objective (over all contexts and groups) has the form: JGRPO(θ)=Eq,{oi}[1Gi=1G(A~i(θ)βDi(θ))]\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q, \{o_i\}}\left[ \frac{1}{G}\sum_{i=1}^G \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right) \right]

Unlike RLHF’s logarithmic pooling, which yields a stationary policy °

πθ(oq)πref(oq)exp(1βr(oq)),\pi_\theta(o|q) \propto \pi_{ref}(o|q) \exp\left( \frac{1}{\beta} r(o|q) \right),

GRPO aggregates preferences through a more complex, nonlinear transformation of group-relative preference °.

Stationary Policy Characterization

A stationary policy under repeated GRPO can be characterized implicitly by a fixed-point equation (see the KKT analysis in the original paper):

(1PG(oπθ(q),q)Eo[PG(oπθ(q),q)]β)πθ(oq)=πref(oq)\left( 1 - \frac{ \mathcal{P}_G(o| \pi_\theta(\cdot|q), q) - \mathbb{E}_{o'}[\mathcal{P}_G(o'|\pi_\theta(\cdot|q), q)] }{\beta} \right) \pi_\theta(o|q) = \pi_{ref}(o|q)

Here, PG(oπθ,q)\mathcal{P}_G(o|\pi_\theta, q) denotes the expected group-relative normalized advantage, and β\beta sets the strength of regularization to the reference. The solution deviates from simple exponentiation or logarithmic pooling and instead involves a rational transformation, which increases local policy mass for outputs with above-average relative advantage.

Role and Form of the Penalty Function

The penalty in the GRPO objective,

Di(θ)=πref(oiq)πθ(oiq)logπref(oiq)πθ(oiq)1,D_i(\theta) = \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log\frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1,

serves as an unbiased estimator ° for KL-divergence ° (when πθ=πθold\pi_\theta = \pi_{\theta_{old}}), but crucially, its gradient matches that of reverse KL ° divergence: πθKLrev(πθπref)=πref(oq)πθ(oq).\frac{\partial}{\partial \pi_\theta} \text{KL}_\mathrm{rev}(\pi_\theta\|\pi_{ref}) = -\frac{\pi_{ref}(o|q)}{\pi_\theta(o|q)}. Reverse KL regularization ° imparts a mode-seeking property, discouraging the policy from assigning probability to outputs rarely chosen by the reference policy, thereby focusing policy mass on a few high-quality modes.

Connection to Pairwise Comparison and Binary Feedback

For groups of size two (G=2G=2), GRPO’s normalized advantage simplifies to: [ A_i = \begin{cases} 1 &\text{ if } r_i > r_j \ -1 &\text{ if } r_i < r_j \ 0 &\text{ if } r_i = r_j \end{cases} ] This makes GRPO’s reward function exactly equivalent to pairwise comparison ° (the core of many RLHF and direct preference optimization methods), and the update mechanism is mathematically aligned with optimizing for correct pairwise outcomes.

Explicit Characterization for Binary Questions

For binary outputs, let aa and bb be possible answers with confidence margin γa,b=P(ab)P(ba)\gamma_{a,b} = \mathcal{P}(a \succ b) - \mathcal{P}(b \succ a). The stationary probability for aa is:

πθ(aq)=12(1βγa,b+(1βγa,b)2+4βγa,bπref(aq))\pi_\theta(a|q) = \frac{1}{2} \left( 1 - \frac{\beta}{\gamma_{a,b} + \sqrt{ \left(1 - \frac{\beta}{\gamma_{a,b}}\right)^2 + 4\frac{\beta}{\gamma_{a,b}}\pi_{ref}(a|q) } } \right)

  • As β0\beta \to 0, reward dominates, and the policy will select the highest-confidence answer.
  • For large β\beta, the reference policy dominates.

This rigorously quantifies how regularization and preference margins interact to determine the stationary policy's behavior.

GRPO Variants and Practical Implications

  • Direct KL Penalty: Replacing GRPO’s reverse KL penalty with a standard KL penalty recovers RLHF’s logarithmic pooling:

πθ(oq)πref(oq)exp(PG(oπθ,q)β) \pi_\theta(o|q) \propto \pi_{ref}(o|q) \exp\left( \frac{ \mathcal{P}_G(o|\pi_\theta,q)}{\beta} \right)

  • Reward Normalization °: Using rewards without scaling (shift only or none) removes invariance to reward scale, making absolute reward values matter—a property often exploited (intentionally or not) in practical reward design °.

Summary Table: Aggregation Mechanisms

Algorithm Reference-Penalty Reward Normalization Aggregation Formula
RLHF / Log pooling Direct KL None or shift πrefexp(rβ)\propto \pi_{ref} \cdot \exp( \frac{r}{\beta} )
GRPO Reverse KL Shift & scale πrefg(PGE[PG]β)\propto \pi_{ref} \cdot g\left( \frac{ \mathcal{P}_G - \mathbb{E}[\mathcal{P}_G] }{\beta} \right)
GRPO (mod.) Direct KL Shift & scale πrefexp(PGβ)\propto \pi_{ref} \cdot \exp\left( \frac{ \mathcal{P}_G }{ \beta } \right)

Here, g(x)=1/(1x)g(x) = 1/(1-x), and PG\mathcal{P}_G is the group-relative preference as previously defined.

Practical Considerations for Implementation

  • Mode Seeking vs. Mean Seeking: Reverse KL (GRPO) policy updates concentrate probability on peaked behaviors and reduce diversity (mode-seeking), while direct KL (RLHF) spreads probability (mean-seeking). This affects exploration/exploitation and the expected safety or risk behavior of the trained model.
  • Reward Normalization: Group shift-and-scale enhances robustness to reward scaling issues and mitigates reward hacking ° to a degree, but selecting β\beta and other parameters is critical to strike a balance between adhering to the reference and maximizing reward.
  • Pairwise/Group Feedback: For G=2G=2, GRPO's groupwise preference update is robust to reward model calibration, matching the human paradigm for comparisons.
  • Algorithmic Flexibility: GRPO’s structure admits straightforward modifications to the penalty or normalization schema, enabling practitioners to fine-tune alignment behavior for application needs.

Conclusion

The alignment objective of GRPO is fundamentally distinct from RLHF’s log pooling framework. By leveraging group-normalized, relative preference aggregation ° and a reverse KL regularization, GRPO induces a unique, controllable stationary distribution ° favoring high-reward outputs while maintaining reference adherence. The framework’s mathematical transparency affords precise parameter control, robust behavior under binary or groupwise feedback, and clear guidance for aligning advanced AI models in practice.

References: