Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Group Relative Policy Optimization (GRPO)

Updated 22 June 2025

Group Relative Policy Optimization (GRPO) is a reinforcement learning framework designed to align AI models—particularly LLMs—with complex objectives such as mathematical reasoning, language safety, and human preferences. GRPO plays a central role in post-training pipelines for advanced models like DeepSeek-R1-Zero and DeepSeekMath, offering a robust alternative to standard RLHF alignment by leveraging group-based normalization, a distinct preference aggregation strategy, and a reverse-KL regularization term. The foundational characteristics, mathematical structure, and implications of GRPO are summarized as follows.

1. Principles of Preference Aggregation in GRPO

GRPO aggregates preferences over groups of sampled outputs in a notably different manner from standard RLHF and logarithmic pooling. Consider a context qq (e.g., a prompt or input) with candidate outputs oOo \in \mathcal{O}.

In RLHF, the optimal aligned policy is usually formed via logarithmic pooling: πθ(oq)=1Zqπref(oq)exp(1βrϕ(oq))\pi_\theta(o \mid q) = \frac{1}{Z_q} \pi_{\text{ref}}(o \mid q) \exp\left(\frac{1}{\beta} r_\phi(o \mid q)\right) where rϕ(oq)r_\phi(o \mid q) is the reward, and πref\pi_\text{ref} is a reference policy.

In contrast, GRPO constructs the aligned policy by scaling the reference probability via a nonlinear offset based on group-relative normalized advantages: πθ(oq)=g(PG(oπθ(q),q)Eo[PG(oπθ(q),q)]β)πref(oq)\pi_\theta(o \mid q) = g\left(\frac{ \mathcal{P}_G(o \mid \pi_\theta(\cdot \mid q), q) - \mathbb{E}_{o'}[\mathcal{P}_G(o' \mid \pi_\theta(\cdot \mid q), q)] }{\beta}\right) \pi_{\text{ref}}(o \mid q) where g(x)=11xg(x) = \frac{1}{1-x}, PG\mathcal{P}_G encapsulates the group-relative advantage, β\beta is the regularization parameter, and the expectation is over oπθ(q)o' \sim \pi_\theta(\cdot \mid q).

Key distinction: Instead of the exponential scaling of rewards typical to RLHF, GRPO applies a nonlinear rescaling based on centered, group-normalized advantages. This makes the update invariant to reward scale and translation, focusing optimization on relative merit within a sample group.

2. Reward Preference Model: Shift-and-Scale Normalization

Within GRPO, the pivotal reward preference model applies both shift and scale normalization to the group of rewards. For a sampled group of GG outputs {o1,...,oG}\{o_1, ..., o_G\}, with corresponding rewards {r1,...,rG}\{r_1, ..., r_G\}, the advantage for output oio_i is

Ai=rimean(r1,...,rG)std(r1,...,rG)A_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}

This normalized advantage is central to training; it ensures the learning signal is invariant to affine transformations (i.e., adding or scaling all rewards), so policy updates emphasize ranking (relative order) rather than absolute reward values.

The policy update objective then utilizes the expectation of these advantages: RG(θq)=E{oi}i=1Gπθold(q)[1Gi=1Gπθ(oiq)πθold(oiq)Ai]\mathcal{R}_G(\theta \mid q) = \mathbb{E}_{\{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{\pi_\theta(o_i\mid q)}{\pi_{\theta_\text{old}}(o_i\mid q)} A_i \right] This formulation weights the policy improvement by the (normalized) relative outperformance of an output, rather than the output's raw reward.

Significance: This normalization enhances training stability, avoids scale-sensitivity pathologies in reward models, and focuses policy search on the ordering induced locally within each sample batch.

3. Penalty Function and the Role of Reverse KL Divergence

GRPO incorporates a penalty function that, in its stationary form, closely approximates the reverse Kullback-Leibler divergence between the reference and candidate policies. For output oio_i,

Di(θ)=πref(oiq)πθ(oiq)logπref(oiq)πθ(oiq)1D_i(\theta) = \frac{\pi_{\text{ref}}(o_i \mid q)}{\pi_\theta(o_i \mid q)} - \log \frac{\pi_{\text{ref}}(o_i \mid q)}{\pi_\theta(o_i \mid q)} - 1

and so, across a group,

D(θq)=E{oi}i=1G[1Gi=1GDi(θ)]\mathcal{D}(\theta \mid q) = \mathbb{E}_{\{o_i\}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G D_i(\theta) \right]

For stationary policies (πθold=πθ\pi_{\theta_\text{old}} = \pi_\theta), the gradient of this penalty matches the gradient of the reverse KL: πθ(oq)D(θq)=πref(oq)πθ(oq)+1\frac{\partial}{\partial \pi_\theta(o \mid q)} \mathcal{D}(\theta \mid q) = -\frac{\pi_{\text{ref}}(o\mid q)}{\pi_\theta(o\mid q)} + 1 Hence, the regularization acts to penalize the divergence from the reference to the candidate policy (as opposed to the standard RLHF which penalizes candidate-to-reference divergence). This influences exploration and alignment behavior, promoting retention of diversity present in the reference.

4. Pairwise Preference and the Connection to Pairwise Methods

When the group size G=2G = 2, the group-normalized advantage reduces to a pairwise comparison, akin to algorithms based on explicit preference learning: Ai=sign(rirj)A_i = \text{sign}(r_i - r_j) The group preference then expresses the probability that output oo outperforms oo' under the reward: P(ooq)=P[r(oq)>r(oq)]\mathcal{P}(o \succ o' \mid q) = \mathbb{P}[r(o \mid q) > r(o' \mid q)] The reward preference model becomes: R2(θ)=Eq,o,o[P(ooq)P(ooq)]\mathcal{R}_2(\theta) = \mathbb{E}_{q, o, o'} [\mathcal{P}(o \succ o' \mid q) - \mathcal{P}(o' \succ o \mid q)] If preferences are antisymmetric, R2(θ)=2Eq,o,o[P(ooq)]1\mathcal{R}_2(\theta) = 2 \mathbb{E}_{q, o, o'} [\mathcal{P}(o \succ o' \mid q)] - 1. This is directly analogous to other pairwise preference methods used in LLM and RL alignment.

5. Dependence on Regularization and Confidence Parameters

The balance between the reference policy and reward preference is controlled by the regularization constant β\beta and the confidence margin of the reward comparisons. For binary outputs a,ba, b and group size two, the aligned probability assigned to aa is: πθ(aq)=12(1βγa,b+(1βγa,b)2+4βγa,bπref(aq))\pi_\theta(a \mid q) = \frac{1}{2} \left( 1 - \frac{\beta}{\gamma_{a,b}} + \sqrt{\left(1 - \frac{\beta}{\gamma_{a,b}}\right)^2 + 4 \frac{\beta}{\gamma_{a,b}} \pi_{\text{ref}}(a\mid q)} \right) where γa,b\gamma_{a,b} is the group confidence margin.

  • As β0\beta \to 0, the model deterministically favors the preferred outcome (πθ(aq)1\pi_\theta(a|q) \to 1).
  • As β\beta increases, the reference policy increasingly dominates (πθ(aq)πref(aq)\pi_\theta(a|q) \to \pi_\text{ref}(a|q)).
  • For large groups (GG \to \infty), dependency on confidence margin disappears, and only β\beta determines the alignment.

Interpretation: This offers fine-grained control over the trade-off between reward signal exploitation and adherence to the base distribution.

6. Modifications and Connections to RLHF

  • Direct KL Penalty: If the regularization in GRPO is swapped for direct KL divergence (as in standard RLHF), the stationary policy recovers the RLHF (exponential) policy update:

πθ(oq)=1Zqexp(1βPG(oπθ(q),q))πref(oq)\pi_\theta(o \mid q) = \frac{1}{Z_q} \exp\left( \frac{1}{\beta} \mathcal{P}_G(o \mid \pi_\theta(\cdot \mid q), q) \right) \pi_\text{ref}(o \mid q)

  • Omitting Scale Normalization: If only shift normalization is used (subtracting but not dividing rewards by the group std), GRPO reduces to a standard RLHF objective, with updates anchored in the raw expected reward differences.

Significance: GRPO unifies and generalizes a family of RL-based alignment objectives. Adjusting its penalty and normalization choices smoothly interpolates between nonlinear scaling (unique to GRPO) and the exponential pooling of RLHF.

7. Implications and Theoretical Significance

GRPO formalizes a distinct path for preference aggregation and policy alignment:

  • Group-based, scale-invariant normalization of reward improves robustness to noisy, miscalibrated, or variable-scale reward models.
  • The use of reverse KL regularization instead of forward (direct) KL encourages solutions that more robustly preserve coverage of the action space, potentially avoiding over-concentration around single modes.
  • The framework admits analytic characterizations and connects group-preference aggregation to classical pairwise comparison and preference learning literature.
  • Through explicit parameterization, GRPO offers tunable interpolation between deterministic reward maximization and reference retention—allowing tailored policy anchoring for safety, exploration, or conservative generalization.

Summary Table: Key Differences between GRPO and RLHF

Aspect GRPO RLHF
Preference Aggregation Nonlinear scaling via group-normalized advantage Exponential (log-pooling)
Regularization Reverse KL (ref \to cand.) Direct KL (cand. \to ref)
Robustness to Reward Scale Invariant (shift-and-scale) Scale-sensitive
Pairwise Preference Match Yes, at G=2G = 2 Yes, via pairwise DPO/NLHF

GRPO's alignment objective delivers both theoretical generality and robust empirical performance for post-training large AI models, with practical significance for mathematical reasoning, language alignment, and beyond.