Group Relative Policy Optimization (GRPO)

Updated 22 June 2025

Group Relative Policy Optimization (GRPO) is a reinforcement learning framework designed to align AI models—particularly LLMs—with complex objectives such as mathematical reasoning, language safety, and human preferences. GRPO plays a central role in post-training pipelines for advanced models like DeepSeek-R1-Zero and DeepSeekMath, offering a robust alternative to standard RLHF alignment by leveraging group-based normalization, a distinct preference aggregation strategy, and a reverse-KL regularization term. The foundational characteristics, mathematical structure, and implications of GRPO are summarized as follows.

1. Principles of Preference Aggregation in GRPO

GRPO aggregates preferences over groups of sampled outputs in a notably different manner from standard RLHF and logarithmic pooling. Consider a context $q$ (e.g., a prompt or input) with candidate outputs $o \in \mathcal{O}$ .

In RLHF, the optimal aligned policy is usually formed via logarithmic pooling: $\pi_\theta(o \mid q) = \frac{1}{Z_q} \pi_{\text{ref}}(o \mid q) \exp\left(\frac{1}{\beta} r_\phi(o \mid q)\right)$ where $r_\phi(o \mid q)$ is the reward, and $\pi_\text{ref}$ is a reference policy.

In contrast, GRPO constructs the aligned policy by scaling the reference probability via a nonlinear offset based on group-relative normalized advantages: $\pi_\theta(o \mid q) = g\left(\frac{ \mathcal{P}_G(o \mid \pi_\theta(\cdot \mid q), q) - \mathbb{E}_{o'}[\mathcal{P}_G(o' \mid \pi_\theta(\cdot \mid q), q)] }{\beta}\right) \pi_{\text{ref}}(o \mid q)$ where $g(x) = \frac{1}{1-x}$ , $\mathcal{P}_G$ encapsulates the group-relative advantage, $\beta$ is the regularization parameter, and the expectation is over $o' \sim \pi_\theta(\cdot \mid q)$ .

Key distinction: Instead of the exponential scaling of rewards typical to RLHF, GRPO applies a nonlinear rescaling based on centered, group-normalized advantages. This makes the update invariant to reward scale and translation, focusing optimization on relative merit within a sample group.

2. Reward Preference Model: Shift-and-Scale Normalization

Within GRPO, the pivotal reward preference model applies both shift and scale normalization to the group of rewards. For a sampled group of $G$ outputs $\{o_1, ..., o_G\}$ , with corresponding rewards $\{r_1, ..., r_G\}$ , the advantage for output $o_i$ is

$A_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}$

This normalized advantage is central to training; it ensures the learning signal is invariant to affine transformations (i.e., adding or scaling all rewards), so policy updates emphasize ranking (relative order) rather than absolute reward values.

The policy update objective then utilizes the expectation of these advantages: $\mathcal{R}_G(\theta \mid q) = \mathbb{E}_{\{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{\pi_\theta(o_i\mid q)}{\pi_{\theta_\text{old}}(o_i\mid q)} A_i \right]$ This formulation weights the policy improvement by the (normalized) relative outperformance of an output, rather than the output's raw reward.

Significance: This normalization enhances training stability, avoids scale-sensitivity pathologies in reward models, and focuses policy search on the ordering induced locally within each sample batch.

3. Penalty Function and the Role of Reverse KL Divergence

GRPO incorporates a penalty function that, in its stationary form, closely approximates the reverse Kullback-Leibler divergence between the reference and candidate policies. For output $o_i$ ,

$D_i(\theta) = \frac{\pi_{\text{ref}}(o_i \mid q)}{\pi_\theta(o_i \mid q)} - \log \frac{\pi_{\text{ref}}(o_i \mid q)}{\pi_\theta(o_i \mid q)} - 1$

and so, across a group,

$\mathcal{D}(\theta \mid q) = \mathbb{E}_{\{o_i\}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G D_i(\theta) \right]$

For stationary policies ( $\pi_{\theta_\text{old}} = \pi_\theta$ ), the gradient of this penalty matches the gradient of the reverse KL: $\frac{\partial}{\partial \pi_\theta(o \mid q)} \mathcal{D}(\theta \mid q) = -\frac{\pi_{\text{ref}}(o\mid q)}{\pi_\theta(o\mid q)} + 1$ Hence, the regularization acts to penalize the divergence from the reference to the candidate policy (as opposed to the standard RLHF which penalizes candidate-to-reference divergence). This influences exploration and alignment behavior, promoting retention of diversity present in the reference.

4. Pairwise Preference and the Connection to Pairwise Methods

When the group size $G = 2$ , the group-normalized advantage reduces to a pairwise comparison, akin to algorithms based on explicit preference learning: $A_i = \text{sign}(r_i - r_j)$ The group preference then expresses the probability that output $o$ outperforms $o'$ under the reward: $\mathcal{P}(o \succ o' \mid q) = \mathbb{P}[r(o \mid q) > r(o' \mid q)]$ The reward preference model becomes: $\mathcal{R}_2(\theta) = \mathbb{E}_{q, o, o'} [\mathcal{P}(o \succ o' \mid q) - \mathcal{P}(o' \succ o \mid q)]$ If preferences are antisymmetric, $\mathcal{R}_2(\theta) = 2 \mathbb{E}_{q, o, o'} [\mathcal{P}(o \succ o' \mid q)] - 1$ . This is directly analogous to other pairwise preference methods used in LLM and RL alignment.

5. Dependence on Regularization and Confidence Parameters

The balance between the reference policy and reward preference is controlled by the regularization constant $\beta$ and the confidence margin of the reward comparisons. For binary outputs $a, b$ and group size two, the aligned probability assigned to $a$ is: $\pi_\theta(a \mid q) = \frac{1}{2} \left( 1 - \frac{\beta}{\gamma_{a,b}} + \sqrt{\left(1 - \frac{\beta}{\gamma_{a,b}}\right)^2 + 4 \frac{\beta}{\gamma_{a,b}} \pi_{\text{ref}}(a\mid q)} \right)$ where $\gamma_{a,b}$ is the group confidence margin.

As $\beta \to 0$ , the model deterministically favors the preferred outcome ( $\pi_\theta(a|q) \to 1$ ).
As $\beta$ increases, the reference policy increasingly dominates ( $\pi_\theta(a|q) \to \pi_\text{ref}(a|q)$ ).
For large groups ( $G \to \infty$ ), dependency on confidence margin disappears, and only $\beta$ determines the alignment.

Interpretation: This offers fine-grained control over the trade-off between reward signal exploitation and adherence to the base distribution.

6. Modifications and Connections to RLHF

Direct KL Penalty: If the regularization in GRPO is swapped for direct KL divergence (as in standard RLHF), the stationary policy recovers the RLHF (exponential) policy update:

$\pi_\theta(o \mid q) = \frac{1}{Z_q} \exp\left( \frac{1}{\beta} \mathcal{P}_G(o \mid \pi_\theta(\cdot \mid q), q) \right) \pi_\text{ref}(o \mid q)$

Omitting Scale Normalization: If only shift normalization is used (subtracting but not dividing rewards by the group std), GRPO reduces to a standard RLHF objective, with updates anchored in the raw expected reward differences.

Significance: GRPO unifies and generalizes a family of RL-based alignment objectives. Adjusting its penalty and normalization choices smoothly interpolates between nonlinear scaling (unique to GRPO) and the exponential pooling of RLHF.

7. Implications and Theoretical Significance

GRPO formalizes a distinct path for preference aggregation and policy alignment:

Group-based, scale-invariant normalization of reward improves robustness to noisy, miscalibrated, or variable-scale reward models.
The use of reverse KL regularization instead of forward (direct) KL encourages solutions that more robustly preserve coverage of the action space, potentially avoiding over-concentration around single modes.
The framework admits analytic characterizations and connects group-preference aggregation to classical pairwise comparison and preference learning literature.
Through explicit parameterization, GRPO offers tunable interpolation between deterministic reward maximization and reference retention—allowing tailored policy anchoring for safety, exploration, or conservative generalization.

Summary Table: Key Differences between GRPO and RLHF

Aspect	GRPO	RLHF
Preference Aggregation	Nonlinear scaling via group-normalized advantage	Exponential (log-pooling)
Regularization	Reverse KL (ref $\to$ cand.)	Direct KL (cand. $\to$ ref)
Robustness to Reward Scale	Invariant (shift-and-scale)	Scale-sensitive
Pairwise Preference Match	Yes, at $G = 2$	Yes, via pairwise DPO/NLHF

GRPO's alignment objective delivers both theoretical generality and robust empirical performance for post-training large AI models, with practical significance for mathematical reasoning, language alignment, and beyond.

PDF Markdown Bookmark Chat (Pro)