- The paper introduces a reward dithering technique that mitigates gradient vanishing and explosion in LLM training with discrete rewards.
- It integrates noise injection into GRPO without altering the core optimization loop, delivering up to a +4.5 improvement on benchmarks.
- The study offers theoretical guarantees and empirical evidence that ReDit enhances training stability and significantly reduces convergence time.
ReDit: Reward Dithering for Improved LLM Policy Optimization
The paper introduces ReDit, a reward dithering technique aimed at addressing the optimization pathologies encountered when training LLMs using discrete, rule-based rewards. The focus is on LLM policy optimization scenarios, particularly those leveraging Group Relative Policy Optimization (GRPO), where reward signals are deterministic and typically binary (e.g., correct/incorrect answers in mathematics and coding tasks). While such rule-based rewards help mitigate reward hacking and allow for straightforward reward specification, they pose significant challenges to gradient-based reinforcement learning, including gradient vanishing, gradient explosion, and slow convergence.
Discrete Rewards and Optimization Instability
Under the discrete reward regime, LLMs often receive sparse and abrupt feedback—either full credit or none—resulting in:
- Gradient vanishing: Mini-batches dominated by zero rewards (common during early training) yield negligible policy gradients, stalling learning and limiting policy exploration.
- Gradient explosion: Rare transitions from zero to one in the reward function lead to large policy gradient updates, destabilizing the optimization trajectory.
Empirical evidence in the paper clearly demonstrates that standard GRPO training exhibits sharp oscillations in gradient norms, with both vanishing and exploding gradients, resulting in erratic learning curves and slower convergence, particularly on complex reasoning tasks such as GSM8K and MATH.
Reward Dithering: Smoothing the Optimization Landscape
ReDit proposes to inject controlled, zero-mean random noise into each discrete reward signal at training time. By transforming a strictly binary signal into a continuous, noisy one, the reward distribution within each mini-batch gains increased variance, even when the underlying task reward is sparse. The process is simple and does not interfere with the reward function's logic beyond this additive perturbation. Both Gaussian and uniform noise distributions are studied.
Algorithmic Integration
The application of ReDit does not require changes to the backbone policy optimization loop or the architectural foundation of GRPO:
1
2
3
|
def reward_with_noise(discrete_reward, stddev=0.05):
noise = np.random.normal(loc=0.0, scale=stddev)
return discrete_reward + noise |
Within a typical policy optimization step:
- For each sampled output, compute the original discrete reward.
- Add independent noise to each reward value to produce smoothed rewards.
- Proceed with batch-relative normalization and advantage calculation as in GRPO.
- Update the policy using the perturbed advantage estimates.
This intervention smoothes the distinction between hard reward boundaries, ensuring that mini-batch statistics (means and variances) used for advantage computation remain informative and dynamic throughout training.
Theoretical Guarantees
The authors present several theoretical contributions substantiating ReDit's effects:
- Unbiased gradients: The expectation of the noisy policy gradient estimates is equal to that of the original objective—policy optimization is, in expectation, unchanged.
- Increased gradient variance: The variance of the policy gradient estimator increases with the noise magnitude, which empirically and theoretically accelerates exploration and helps escape flat gradients.
- Convergence trade-off: Strategic introduction of variance via dithering reduces the lower bound on optimization time but comes at the expense of reward signal fidelity. This trade-off is made explicit, with bounds derived for time-to-convergence under varying noise.
The theoretical framework connects ReDit’s benefits directly to fundamental challenges in policy optimization, extending results from recent analyses on reward model variance versus accuracy.
Empirical Results
Extensive experiments corroborate these claims across diverse datasets (GSM8K, MATH, Geometry3K) and LLM architectures (Qwen2.5, Llama-3, Mistral-7B, Ministral-8B):
- With ReDit, models achieve comparable or better final performance after 1,000 training steps relative to baseline GRPO trained for 9,000 steps across all tasks.
- The peak accuracy improvements are consistent (e.g., up to +4.5 points on the MATH benchmark), and convergence speed is markedly increased.
- Gradient instability, as reflected in vanishing and exploding metrics, is effectively suppressed when using ReDit.
- The benefits generalize across other RL fine-tuning baselines (DAPO, Dr.GRPO, REINFORCE++), and across various policy models.
- Smoothing continuous rewards (from learned or preference-trained reward models) yields negligible benefit, indicating that ReDit's value is specific to discrete, sparse rewards.
Example Results Table
Task |
Baseline GRPO |
GRPO + ReDit (Gauss) |
Improvement |
GSM8K |
89.07 |
90.76 |
+1.69 |
MATH |
48.01 |
52.55 |
+4.54 |
Geometry3K |
43.10 |
44.67 |
+1.57 |
Hyperparameter and Practical Considerations
- Noise variance schedules: The optimal perturbation magnitude is task- and model-dependent. Excessive smoothing degrades performance by masking genuine reward signals; insufficient smoothing fails to address gradient pathologies. Cyclic or scheduled noise injection (e.g., cosine scheduling) can further enhance convergence and stability.
- Implementation simplicity: Only the reward function needs to be modified, making ReDit integrable into existing RL pipelines with minimal code changes.
- Computational efficiency: As the benefits manifest in dramatically reduced training steps, ReDit also effectively reduces total compute, especially in large-scale LLM fine-tuning scenarios.
- Resource requirements: The additional computational cost is negligible, limited to the noise generation step.
- Generalizability: While highly effective for tasks with sparse/discrete reward structures, ReDit provides no gains when continuous, dense reward models are used.
Implications and Future Directions
Practically, ReDit offers an efficient, theoretically grounded machinery to accelerate LLM policy optimization in domains where the use of interpretable, hack-resistant, yet discrete reward functions is preferable (e.g., math, code, logic tasks, or settings with clear correctness criteria). This approach allows practitioners to harness the benefits of rule-based rewards without incurring the gradient instabilities and slow convergence that previously forced reliance on complex and often biased learned reward models.
On the theoretical front, the work contributes to the conversation about reward model optimality, suggesting that high-fidelity, low-variance reward functions are not universally optimal for RL fine-tuning, and a calibrated trade-off—achieved via explicit variance injection—can be preferable in many scenarios.
Future research will likely focus on:
- Automated noise scheduling: Techniques to adaptively set or learn optimal dithering schedules per task, batch, or model state.
- Broader RL settings: Extensions to multi-modal or multi-agent RL scenarios where similar optimization pathologies exist in the presence of discrete reward signals.
- Synergy with other exploration methods: Integration with reward shaping, intrinsic motivation, and other variance-driven RL methods for further gains.
In summary, ReDit provides a simple, efficient mechanism to address foundational optimization challenges in reinforcement learning with discrete rewards, facilitating better, faster, and more robust policy optimization for LLMs. Its empirical and theoretical analysis underscores the value of revisiting the bias-variance paradigm in reward design for contemporary LLMs.