The Impact of Negative Gradients on Likelihood in Group Relative Deep Reinforcement Optimization
The paper examines the effect of negative gradients within Group Relative Policy Optimization (GRPO), an algorithm employed for reasoning enhancement in LLMs through reinforcement learning (RL). GRPO has been widely integrated into models like DeepSeek-R1 and DeepSeek-Math, known for marked performance improvements in various domains, including mathematics and medical reasoning. Despite its success, this paper reveals the occurrence of a phenomenon designated as Lazy Likelihood Displacement (LLD), where the probability of accurate responses demonstrates limited increase or even decline during training, reminiscent of issues seen in Direct Preference Optimization (DPO) due to negative gradient impacts.
Negative Gradients in GRPO
Negative gradients occur in RL when the advantage function evaluates to a negative, indicating inferior action selection relative to a standard reference action, warranting a reduction in the action's future selection probability. This mechanism is parallel to DPO where penalties for less-preferred responses similarly introduce negative gradients. The paper asserts that negative gradients can misguidedly decrease the probability of correct responses, a problematic scenario termed "misalignment" in preference optimization.
LLD arises in GRPO from the indiscriminate penalization technique applied to tokens within incorrect responses. Here, incorrect actions are uniformly discouraged, inadvertently impacting tokens associated with correct responses that share similarities in structure or semantics. Empirical analysis on mathematical reasoning benchmarks demonstrates the prevalence of LLD, with results indicating a dwarfed likelihood change for supposed correct responses.
Mitigation Strategy: NTHR
To alleviate LLD, researchers propose a novel method called Negative Token Hidden Reward (NTHR), exploiting GRPO's group-based orientation to selectively target tokens for penalization. Instead of generic penalty on complete incorrect responses, NTHR identifies specific influential tokens, using correct responses as reference anchors to discern detrimental tokens.
The use of NTHR significantly reduces LLD effects and enhances model performance, with experiments showcasing consistent improvements in log-likelihood gains across models varying from 0.5 billion to 3 billion parameters. This transformation in penalty assignment from a response-wide to a token-level focus enables more granular correction of negative gradient impacts, potentially optimizing data efficiency and preserving the structural integrity of the learning process.
Theoretical Implications
The theoretical dissection of GRPO and LLD gives insights into the optimization dynamics of LLMs within RL frameworks. By shifting penalty focus to individual tokens strongly associated with LLD, researchers challenge traditional RL optimization frameworks, like PPO, that conventionally address action sequences in their entirety rather than dissecting them into finer, semantically or structurally aligned components.
Practical Implications and Future Directions
The proposed NTHR strategy positions itself as a formidable approach for addressing token-level alignment issues that surface due to negative gradient introduction in LLM training. Its implementation could facilitate more refined RL methods that respect token interdependencies and grounded contextual cues. Looking forward, more encompassing studies could explore extending NTHR for broader optimization scenarios, expanding its applicability and efficacy in enhancing LLM reasoning capabilities across diverse datasets and challenges. Consequently, the paper sets a critical precedent for fine-tuning RL algorithms to align more closely with nuanced LLM architecture requirements and constraints, paving pathways for advances in AI's problem-solving robusticity and fidelity.