On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization (2505.18830v1)

Published 24 May 2025 in cs.LG and cs.CL

Abstract: Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of LLMs, with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Summary

The Impact of Negative Gradients on Likelihood in Group Relative Deep Reinforcement Optimization

The paper examines the effect of negative gradients within Group Relative Policy Optimization (GRPO), an algorithm employed for reasoning enhancement in LLMs through reinforcement learning (RL). GRPO has been widely integrated into models like DeepSeek-R1 and DeepSeek-Math, known for marked performance improvements in various domains, including mathematics and medical reasoning. Despite its success, this paper reveals the occurrence of a phenomenon designated as Lazy Likelihood Displacement (LLD), where the probability of accurate responses demonstrates limited increase or even decline during training, reminiscent of issues seen in Direct Preference Optimization (DPO) due to negative gradient impacts.

Negative Gradients in GRPO

Negative gradients occur in RL when the advantage function evaluates to a negative, indicating inferior action selection relative to a standard reference action, warranting a reduction in the action's future selection probability. This mechanism is parallel to DPO where penalties for less-preferred responses similarly introduce negative gradients. The paper asserts that negative gradients can misguidedly decrease the probability of correct responses, a problematic scenario termed "misalignment" in preference optimization.

LLD arises in GRPO from the indiscriminate penalization technique applied to tokens within incorrect responses. Here, incorrect actions are uniformly discouraged, inadvertently impacting tokens associated with correct responses that share similarities in structure or semantics. Empirical analysis on mathematical reasoning benchmarks demonstrates the prevalence of LLD, with results indicating a dwarfed likelihood change for supposed correct responses.

Mitigation Strategy: NTHR

To alleviate LLD, researchers propose a novel method called Negative Token Hidden Reward (NTHR), exploiting GRPO's group-based orientation to selectively target tokens for penalization. Instead of generic penalty on complete incorrect responses, NTHR identifies specific influential tokens, using correct responses as reference anchors to discern detrimental tokens.

The use of NTHR significantly reduces LLD effects and enhances model performance, with experiments showcasing consistent improvements in log-likelihood gains across models varying from 0.5 billion to 3 billion parameters. This transformation in penalty assignment from a response-wide to a token-level focus enables more granular correction of negative gradient impacts, potentially optimizing data efficiency and preserving the structural integrity of the learning process.

Theoretical Implications

The theoretical dissection of GRPO and LLD gives insights into the optimization dynamics of LLMs within RL frameworks. By shifting penalty focus to individual tokens strongly associated with LLD, researchers challenge traditional RL optimization frameworks, like PPO, that conventionally address action sequences in their entirety rather than dissecting them into finer, semantically or structurally aligned components.

Practical Implications and Future Directions

The proposed NTHR strategy positions itself as a formidable approach for addressing token-level alignment issues that surface due to negative gradient introduction in LLM training. Its implementation could facilitate more refined RL methods that respect token interdependencies and grounded contextual cues. Looking forward, more encompassing studies could explore extending NTHR for broader optimization scenarios, expanding its applicability and efficacy in enhancing LLM reasoning capabilities across diverse datasets and challenges. Consequently, the paper sets a critical precedent for fine-tuning RL algorithms to align more closely with nuanced LLM architecture requirements and constraints, paving pathways for advances in AI's problem-solving robusticity and fidelity.