Debiased GRPO: Aligning Policy with Direct KL

Updated 21 October 2025

The paper introduces debiased GRPO by replacing the reverse KL penalty with a direct KL term and removing reward scale normalization to mitigate bias.
It realigns policy updates to use raw reward differences, yielding a standard logarithmic pooling effect that improves numerical stability and aggregation consistency.
The modifications enhance interpretability and robustness by reducing sensitivity to group reward variance and providing more predictable, monotonic policy behavior.

A debiased variant of Group Relative Policy Optimization (GRPO) refers to algorithmic modifications that address inherent sources of bias in GRPO’s preference aggregation, advantage calculation, or policy update, as identified through theoretical and empirical analyses. Biases in GRPO arise from aspects such as the penalty function choice (reverse vs. direct KL divergence), scale normalization of rewards, and the resulting preference aggregation rules. The debiased variants aim to yield aggregation rules more consistent with direct preference learning or to correct for miscalibration introduced by GRPO’s group-relative normalization, thereby producing stationary policies with alignment objectives closer to standard logarithmic pooling (as in conventional RLHF) or to raw, unnormalized reward differences. These modifications have direct consequences on alignment, numerical stability, and the interpretability of policy updates.

1. Alignment Objective and Core Mechanism of GRPO

GRPO is constructed to learn a policy $\pi_\theta$ that aligns to a reference policy $\pi_{\text{ref}}$ under a preference model informed by group-based reward normalization. The standard GRPO objective is:

$J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],$

where the normalized advantage for response $i$ in a group of size $G$ is computed as:

$A_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)},$

and the penalty $D_i(\theta)$ approximates the reverse KL divergence (i.e., $KL(\pi_{\mathrm{ref}} \| \pi_\theta)$ ) via:

$D_i(\theta) = \frac{\pi_{\text{ref}}(o_i|q)}{\pi_\theta(o_i|q)} - \log\left[\frac{\pi_{\text{ref}}(o_i|q)}{\pi_\theta(o_i|q)}\right] - 1.$

The stationary policy induced by this objective features an aggregation rule which is fundamentally distinct from standard logarithmic pooling: it scales $\pi_{\text{ref}}$ using a reciprocal function of the (centered and normalized) group preference, rather than an exponential transformation.

2. Analysis of Preference Aggregation and Bias

The stationary policy condition derived for GRPO is:

$\left(1 - \frac{\mathcal{P}_G(o|\pi_\theta(\cdot|q), q) - \mathbb{E}_{o'\sim\pi_\theta}[\mathcal{P}_G(o'|\pi_\theta(\cdot|q),q)]}{\beta}\right)\pi_\theta(o|q) = \pi_{\mathrm{ref}}(o|q),$

where $\mathcal{P}_G$ is the group-relative preference produced by the normalized reward model. This results in a fixed-point recursion that cannot be reformulated as standard log-opinion pooling. The penalty term’s gradient with respect to $\pi_\theta$ matches that of the reverse KL divergence.

This setup introduces structural bias because:

The reverse KL penalty pulls the policy away from low-probability (under the reference) regions more aggressively than it rewards new high-reward mass, potentially distorting aggregation compared to direct KL methods.
Scale-normalization of group rewards causes the relative weightings of rewards (compared to the reference distribution) to depend strongly on within-group reward variance, introducing variance-related bias.

As a consequence, the resulting aggregation policy does not exhibit the exponential weighting found in log-pooling and RLHF; instead, it follows a rational transformation:

$g(x) = \frac{1}{1-x}.$

This leads to “nonlinear” weighting and can produce discontinuities or over/under-weighting near points where $\pi_{\mathrm{ref}}(o|q) = 0$ , unlike the exponential mechanism of log pooling.

3. Debiasing Modifications: Direct KL and Raw Rewards

Two modifications comprehensively debias GRPO:

Direct KL Penalty: By reparameterizing the penalty term to use the direct KL divergence

$KL(\pi_\theta \| \pi_{\mathrm{ref}})$

instead of reverse KL, the fixed-point aggregation changes to

$\pi_\theta(o|q) \propto \pi_{\mathrm{ref}}(o|q)\exp\left(\frac{\mathcal{P}_G(o|\pi_\theta(\cdot|q), q)}{\beta}\right),$

which is the canonical log-opinion pooling rule also used in RLHF and Nash Learning from Human Feedback (NLHF) settings.

Removal of Scale Normalization: Omitting the division by $\text{std}(r_1,...,r_G)$ makes $\mathcal{P}_G$ and the advantage a raw difference instead of a z-score, yielding advantage calculations based on pure (shifted) reward differences. This modification removes implicit weighting of reward differences by group reward variance, unbiasing the aggregation toward the true mean difference.

Both adjustments re-align the policy update with calibrated, invariant preference aggregation, correcting the nonlinear scaling and variance-based distortions present in original GRPO.

4. Comparative Properties and Binary Case Analysis

The paper provides explicit formulae for the aggregate preference in specific settings:

Groups of size two: The group-relative preference corresponds to pairwise comparison, reducing the aggregation to a signed difference similar to the structure used in pairwise comparison feedback and DPO. This establishes formal equivalence in the limit and highlights the connection with pairwise RLHF methods.
Binary questions: The stationary policy’s fixed-point equation becomes quadratic in $\pi_\theta(a|q)$ , showing non-exponential dependence:

$\pi_\theta(a|q) = \frac{1}{2} \left[1 - \frac{\beta}{\gamma_{a,b}} + \sqrt{\left(1 - \frac{\beta}{\gamma_{a,b}}\right)^2 + 4 \frac{\beta}{\gamma_{a,b}\pi_{\mathrm{ref}}(a|q)}}\right]$

where $\gamma_{a,b}$ quantifies the confidence margin. This differs from the log-linear solution in RLHF.

Consequently, standard GRPO can either excessively emphasize or underweight reward differences depending on $\beta$ , variance, and the structure of $\pi_{\mathrm{ref}}$ , whereas the debiased variants (direct KL, raw rewards) deliver more balanced aggregation.

5. Parameter Sensitivity and Aggregate Preference Structure

The analysis reveals strong parameter dependencies:

The stationary distribution in GRPO is highly sensitive to the ratio $\beta/\gamma_{a,b}$ and the group size.
For large groups, the effective regularization is modulated by the standard deviation of the group reward, acting as a “dynamic” $\beta$ that further shifts the aggregation.

With the debiased modifications, this sensitivity is reduced, and aggregation becomes a function of raw reward differences and the direct regularization parameter, restoring the desirable monotonicity and invariance properties found in log-pooling and pairwise preference approaches.

6. Practical Implications and Synthesis

Applying these debiasing modifications (direct KL penalty, shift-only normalization) in practice yields several improvements:

The aggregation of preferences becomes affine-invariant and monotonic in the raw reward—critical for interpretability and robustness under varying group statistics and reward scales.
The potential for discontinuities or singularities (e.g., when $\pi_{\mathrm{ref}}(o|q) = 0$ ) is mitigated, removing artifacts that could otherwise lead to undesirable training instabilities or “unlearning” in certain output regions.
The stationary solution respects both the strength of individual reward differences and the baseline structure imposed by the reference policy.

In summary, the debiased variant of GRPO is achieved by (i) replacing the reverse KL penalty with the direct KL divergence term in the objective, such that the aggregation becomes standard logarithmic pooling, and (ii) eliminating scale normalization in the advantage, so reward aggregation depends directly on raw (shift-normalized) differences. These changes result in a policy optimization scheme that is free of variance-based biases, aligns more closely with established preference aggregation methods, and produces preferences that are stable, interpretable, and theoretically well-justified. This analysis clarifies the central role of the penalty formulation and reward normalization in determining the nature and quality of preference aggregation within GRPO frameworks, offering guidance for algorithm design in LLM alignment and beyond (Vojnovic et al., 25 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

What is the Alignment Objective of GRPO? (2025)

Follow Topic

Get notified by email when new papers are published related to Debiased Variant of GRPO.