Papers
Topics
Authors
Recent
Search
2000 character limit reached

Debiased GRPO: Aligning Policy with Direct KL

Updated 21 October 2025
  • The paper introduces debiased GRPO by replacing the reverse KL penalty with a direct KL term and removing reward scale normalization to mitigate bias.
  • It realigns policy updates to use raw reward differences, yielding a standard logarithmic pooling effect that improves numerical stability and aggregation consistency.
  • The modifications enhance interpretability and robustness by reducing sensitivity to group reward variance and providing more predictable, monotonic policy behavior.

A debiased variant of Group Relative Policy Optimization (GRPO) refers to algorithmic modifications that address inherent sources of bias in GRPO’s preference aggregation, advantage calculation, or policy update, as identified through theoretical and empirical analyses. Biases in GRPO arise from aspects such as the penalty function choice (reverse vs. direct KL divergence), scale normalization of rewards, and the resulting preference aggregation rules. The debiased variants aim to yield aggregation rules more consistent with direct preference learning or to correct for miscalibration introduced by GRPO’s group-relative normalization, thereby producing stationary policies with alignment objectives closer to standard logarithmic pooling (as in conventional RLHF) or to raw, unnormalized reward differences. These modifications have direct consequences on alignment, numerical stability, and the interpretability of policy updates.

1. Alignment Objective and Core Mechanism of GRPO

GRPO is constructed to learn a policy πθ\pi_\theta that aligns to a reference policy πref\pi_{\text{ref}} under a preference model informed by group-based reward normalization. The standard GRPO objective is:

JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],

where the normalized advantage for response ii in a group of size GG is computed as:

Ai=rimean(r1,,rG)std(r1,,rG),A_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)},

and the penalty Di(θ)D_i(\theta) approximates the reverse KL divergence (i.e., KL(πrefπθ)KL(\pi_{\mathrm{ref}} \| \pi_\theta)) via:

Di(θ)=πref(oiq)πθ(oiq)log[πref(oiq)πθ(oiq)]1.D_i(\theta) = \frac{\pi_{\text{ref}}(o_i|q)}{\pi_\theta(o_i|q)} - \log\left[\frac{\pi_{\text{ref}}(o_i|q)}{\pi_\theta(o_i|q)}\right] - 1.

The stationary policy induced by this objective features an aggregation rule which is fundamentally distinct from standard logarithmic pooling: it scales πref\pi_{\text{ref}} using a reciprocal function of the (centered and normalized) group preference, rather than an exponential transformation.

2. Analysis of Preference Aggregation and Bias

The stationary policy condition derived for GRPO is:

πref\pi_{\text{ref}}0

where πref\pi_{\text{ref}}1 is the group-relative preference produced by the normalized reward model. This results in a fixed-point recursion that cannot be reformulated as standard log-opinion pooling. The penalty term’s gradient with respect to πref\pi_{\text{ref}}2 matches that of the reverse KL divergence.

This setup introduces structural bias because:

  • The reverse KL penalty pulls the policy away from low-probability (under the reference) regions more aggressively than it rewards new high-reward mass, potentially distorting aggregation compared to direct KL methods.
  • Scale-normalization of group rewards causes the relative weightings of rewards (compared to the reference distribution) to depend strongly on within-group reward variance, introducing variance-related bias.

As a consequence, the resulting aggregation policy does not exhibit the exponential weighting found in log-pooling and RLHF; instead, it follows a rational transformation:

πref\pi_{\text{ref}}3

This leads to “nonlinear” weighting and can produce discontinuities or over/under-weighting near points where πref\pi_{\text{ref}}4, unlike the exponential mechanism of log pooling.

3. Debiasing Modifications: Direct KL and Raw Rewards

Two modifications comprehensively debias GRPO:

  1. Direct KL Penalty: By reparameterizing the penalty term to use the direct KL divergence

πref\pi_{\text{ref}}5

instead of reverse KL, the fixed-point aggregation changes to

πref\pi_{\text{ref}}6

which is the canonical log-opinion pooling rule also used in RLHF and Nash Learning from Human Feedback (NLHF) settings.

  1. Removal of Scale Normalization: Omitting the division by πref\pi_{\text{ref}}7 makes πref\pi_{\text{ref}}8 and the advantage a raw difference instead of a z-score, yielding advantage calculations based on pure (shifted) reward differences. This modification removes implicit weighting of reward differences by group reward variance, unbiasing the aggregation toward the true mean difference.

Both adjustments re-align the policy update with calibrated, invariant preference aggregation, correcting the nonlinear scaling and variance-based distortions present in original GRPO.

4. Comparative Properties and Binary Case Analysis

The paper provides explicit formulae for the aggregate preference in specific settings:

  • Groups of size two: The group-relative preference corresponds to pairwise comparison, reducing the aggregation to a signed difference similar to the structure used in pairwise comparison feedback and DPO. This establishes formal equivalence in the limit and highlights the connection with pairwise RLHF methods.
  • Binary questions: The stationary policy’s fixed-point equation becomes quadratic in πref\pi_{\text{ref}}9, showing non-exponential dependence:

JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],0

where JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],1 quantifies the confidence margin. This differs from the log-linear solution in RLHF.

Consequently, standard GRPO can either excessively emphasize or underweight reward differences depending on JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],2, variance, and the structure of JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],3, whereas the debiased variants (direct KL, raw rewards) deliver more balanced aggregation.

5. Parameter Sensitivity and Aggregate Preference Structure

The analysis reveals strong parameter dependencies:

  • The stationary distribution in GRPO is highly sensitive to the ratio JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],4 and the group size.
  • For large groups, the effective regularization is modulated by the standard deviation of the group reward, acting as a “dynamic” JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],5 that further shifts the aggregation.

With the debiased modifications, this sensitivity is reduced, and aggregation becomes a function of raw reward differences and the direct regularization parameter, restoring the desirable monotonicity and invariance properties found in log-pooling and pairwise preference approaches.

6. Practical Implications and Synthesis

Applying these debiasing modifications (direct KL penalty, shift-only normalization) in practice yields several improvements:

  • The aggregation of preferences becomes affine-invariant and monotonic in the raw reward—critical for interpretability and robustness under varying group statistics and reward scales.
  • The potential for discontinuities or singularities (e.g., when JGRPO(θ)=Eqμ,{oi}πθold[1Gi(A~i(θ)βDi(θ))],J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim \mu, \{o_i\} \sim \pi_{\theta^{\text{old}}}}\left[\frac{1}{G} \sum_i \left(\tilde{A}_i(\theta) - \beta D_i(\theta)\right)\right],6) is mitigated, removing artifacts that could otherwise lead to undesirable training instabilities or “unlearning” in certain output regions.
  • The stationary solution respects both the strength of individual reward differences and the baseline structure imposed by the reference policy.

In summary, the debiased variant of GRPO is achieved by (i) replacing the reverse KL penalty with the direct KL divergence term in the objective, such that the aggregation becomes standard logarithmic pooling, and (ii) eliminating scale normalization in the advantage, so reward aggregation depends directly on raw (shift-normalized) differences. These changes result in a policy optimization scheme that is free of variance-based biases, aligns more closely with established preference aggregation methods, and produces preferences that are stable, interpretable, and theoretically well-justified. This analysis clarifies the central role of the penalty formulation and reward normalization in determining the nature and quality of preference aggregation within GRPO frameworks, offering guidance for algorithm design in LLM alignment and beyond (Vojnovic et al., 25 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Debiased Variant of GRPO.