Reward Shaping to Mitigate Reward Hacking in RLHF

Published 26 Feb 2025 in cs.LG, cs.AI, and cs.CL | (2502.18770v3)

Abstract: Reinforcement Learning from Human Feedback (RLHF) is essential for aligning LLMs with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR, and the Work done during the internship at StepFun by Jiayi Fu.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces the Preference As Reward (PAR) method that reshapes RL rewards by centering and bounding to effectively mitigate reward hacking.
It employs a sigmoid function on centered rewards to achieve rapid initial learning and stable convergence, validated on models like Gemma2-2B and Llama3-8B.
Experimental results show that PAR outperforms traditional reward shaping alternatives, ensuring robust alignment on benchmarks such as AlpacaEval 2.0.

Reward Shaping to Mitigate Reward Hacking in RLHF

Introduction

Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning extensive LLMs with human values, yet it is prone to reward hacking. Reward hacking arises when an agent optimizes for deficiencies in the reward function, undermining its intended behavior, thus degrading alignment. This paper explores the prevalent reward shaping methods to address this issue and introduces the Preference As Reward (PAR) approach, exploiting the latent preferences within the reward model for reinforcement learning signals. Evaluations on models such as Gemma2-2B and Llama3-8B exhibit its effectiveness in overcoming reward hacking, achieving state-of-the-art performance on benchmarks like AlpacaEval 2.0, while ensuring robustness and data efficiency.

Design Principles and PAR Method

This work proposes three key design principles for effective reward shaping: bounding the RL reward, encouraging rapid initial growth followed by gradual convergence, and utilizing centered rewards as a function. These principles guide the development of the PAR technique, applying a sigmoid function to centered rewards—the discrepancy between proxy rewards and reference rewards. This function is designed for rapid learning and stable convergence, leveraging latent preferences mimicking human evaluation processes. Thus, the RL reward is interpreted as a preference score between the policy and reference responses.

Figure 1: RLHF training pipeline with reward shaping. Responses from the policy model are evaluated by the reward model, producing proxy rewards. These rewards are then reshaped before being used to update the policy via RL.

Evaluation and Results

Experiments highlight PAR's superior effectiveness in reward shaping compared to alternatives like WARM and Minmax while proving robust against reward hacking over extended training periods.

Experimental Setup

Two base models, Gemma2-2B and Llama3-8B, were evaluated using datasets such as Ultrafeedback-Binarized and HH-RLHF, and PPO algorithm training was conducted. The evaluation metrics included Proxy Reward and Winrate curves.

Principles Validation

The validation of the three design principles is manifested through empirical testing of various sigmoid-like functions, establishing rapid initial growth and bounded rewards as critical factors in successful RL training.

Figure 2: Loss curves from PPO training show that PAR exhibits greater stability, particularly in critic loss, compared to Vanilla training. This stability is attributed to PAR's bounded RL reward.

Benchmark Performance

PAR consistently exhibited top-tier performance across evaluations pertaining to AlpacaEval 2.0 and MT-Bench, supporting the notion that reward shaping improvements enhance alignment without degrading peak performance metrics.

Figure 3: PPO training curves over two epochs. ceil5.0 indicates that $r_{\text{RL}$ is bounded, reinforcing the need for bounded rewards, as demonstrated by improved stability in modeling.

Discussion

The paper discusses the unsuitability of reward shaping approaches like PAR for DPO due to inherent reward model absence, while linear transformations fail to impact GRPO due to inefficacy in modifying advantage calculations. However, non-linear methods such as PAR maintain effectiveness.

Conclusion

The research provides critical insights into RLHF, establishing the importance of reward shaping based on bounded and centered rewards. PAR emerges as the preeminent method for reinforcing optimal behavior, mitigating reward hacking effectively.

Figure 4: The calibration between hidden preference score given by reward model and winrate for different mitigation methods, illustrating PAR's alignment stability.

In sum, this paper contributes vital methodologies advancing RLHF through informed reward shaping strategies, offering substantive implications for future developments in AI alignment techniques.

Markdown Report Issue