Dr. GRPO: Correcting Token Aggregation Bias
- Dr. GRPO is a reinforcement learning method that redefines token aggregation by replacing per-response normalization with group-level scaling.
- It corrects the length bias inherent in standard GRPO by using the group mean length as a uniform scaling factor for token loss.
- Empirical results show that while Dr. GRPO mitigates verbosity bias, adaptive approaches like λ-GRPO outperform it in stability and accuracy.
Dr. GRPO is a modification to the Group Relative Policy Optimization (GRPO) framework that addresses systematic biases in the way token-level policy gradients are aggregated, particularly concerning the influence of sequence length. Within the broader context of RLVR (Reinforcement Learning with Verifiable Rewards) for fine-tuning LLMs, Dr. GRPO arises from the observation that standard GRPO, and several of its variants, rely on heuristic and fixed rules for weighting contributions of individual tokens—rules that are suboptimal for both stability and generalization. Dr. GRPO specifically replaces per-response length normalization in the loss with a group-level averaging strategy, which is then further unified and generalized by the learnable -GRPO framework (Wang et al., 8 Oct 2025).
1. Motivation: Token Aggregation Bias in GRPO
Standard GRPO computes the objective over groups of candidate completions, assigning each response a reward and aggregating the policy loss by distributing the corresponding advantage uniformly to every token in . In this baseline formulation, each token’s loss is weighted by a factor of , and the total per-sequence loss is averaged across the group. This introduces an explicit length bias: longer responses accrue more total gradient magnitude, incentivizing verbosity even when unnecessary. Such aggregation strategies are purely heuristic—fixed for all tasks and datasets—and cannot adapt if concise or verbose answers are contextually preferable.
Dr. GRPO was proposed to correct this systematic bias by altering how the token-level gradients are normalized, aiming to mitigate both the length bias and any unwanted variance-induced artifacts (see Section: Methodology under "Dr. GRPO" in (Wang et al., 8 Oct 2025)).
2. Formulation and Mechanics of Dr. GRPO
Dr. GRPO modifies the standard aggregation of the GRPO policy loss. Under vanilla GRPO, the objective is: where is the length of response , is the token-level importance weight, and is the group-relative advantage.
Dr. GRPO replaces per-response normalization with multiplication by the group’s mean response length : with the difference that now every token in every sample receives the same scaling factor (the group mean length), irrespective of individual sequence length. This eliminates length normalization on a per-response basis, pushing the objective towards a length-neutral aggregation.
Importantly, Dr. GRPO can be interpreted as a special case within a broader class of token weighting schemes, as clarified by the unified token preference framework in -GRPO (Wang et al., 8 Oct 2025): where
- and is the group mean sequence length.
3. Critical Assessment and Unification in -GRPO
While Dr. GRPO addresses certain systematic biases of length normalization by removing per-sequence scaling, it remains a fixed (non-learned) heuristic. Empirical analyses in (Wang et al., 8 Oct 2025) indicate that all such strategies—vanilla GRPO, DAPO, and Dr. GRPO—lack flexibility and cannot adapt to varying task requirements or reward structures. Their fixed form may be suboptimal whenever task preference for output length is either context-dependent or subject to change as learning progresses.
Recognizing this, -GRPO generalizes all fixed schemes via a learnable token preference parameter . In this unified scheme:
- Token weights are computed as , where (with the normalized length of in the group).
- is optimized alongside model weights, so token-level weighting is learned rather than imposed.
- The classic heuristics become limiting cases: recovers DAPO, and suitable choices reproduce Dr. GRPO or vanilla GRPO.
This suggests that, while Dr. GRPO corrects a concrete bias in GRPO, it is subsumed and often outperformed (in terms of stability and accuracy) by the adaptivity introduced by -GRPO’s learnable parameter.
4. Empirical Observations: Dr. GRPO in Context
Experiments on mathematical reasoning benchmarks (e.g., GSM8K, MATH500, Minerva, (Wang et al., 8 Oct 2025)) using Qwen2.5 models (1.5B, 3B, 7B) show that:
- All token aggregation heuristics are dominated by -GRPO with learned .
- Gains are substantial (+1.0–1.9% accuracy) and robust across tasks and model sizes.
- -GRPO achieves this without additional data or computational overhead, and it avoids both entropy collapse and length inflation.
- Dr. GRPO is, algebraically, equivalent to DAPO except for a constant scaling, and thus is not treated as a distinct baseline in ablation studies (see Table: Token Weighting Function and Method Comparison in (Wang et al., 8 Oct 2025)).
5. Practical Implications and Limitations
Heuristic schemes (Dr. GRPO, DAPO, GRPO) may be reasonable defaults, especially if resource constraints prevent re-optimization of token weights. However, for optimal performance and stability—especially in heterogeneous RLVR settings or cross-domain tasks—learning token preference adaptively through -GRPO is strictly superior. Dr. GRPO’s elimination of per-sequence length normalization yields more uniform training signals but cannot respond to shifts in optimal length preference. Empirical and theoretical analyses indicate that the practical value of explicit, hand-tuned token aggregation schemes is limited in the presence of a learnable, context-sensitive parameter.
6. Summary Table: Token Weighting Schemes
| Method | Token Weighting | Preference | Adaptiveness |
|---|---|---|---|
| GRPO | Penalize long output | Fixed, heuristic | |
| DAPO | $1$ | Length-neutral | Fixed, heuristic |
| Dr. GRPO | Group-uniform | Fixed, heuristic | |
| -GRPO | Learnable | Adaptive (learned ) |
7. Conclusion
Dr. GRPO is a principled correction to sequence-length bias in GRPO-style reinforcement learning for LLMs, achieved by moving from per-response length normalization to group-based scaling. Nonetheless, it remains a fixed, heuristic rule and is outperformed by the general and adaptive token-weighting strategy of -GRPO, which unifies prior aggregation schemes within a single, learnable framework. In modern RLVR pipelines, learnable token preferences provide superior accuracy, exploration efficiency, and training stability without incurring additional computational cost (Wang et al., 8 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free