Post-hoc Reward Calibration: A Case Study on Length Bias (2409.17407v1)

Published 25 Sep 2024 in cs.AI and cs.CL

Abstract: Reinforcement Learning from Human Feedback aligns the outputs of LLMs with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM--RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at https://github.com/ZeroYuHuang/Reward-Calibration.

PDF Abstract

Post-hoc Reward Calibration: Addressing Length Bias in Reinforcement Learning from Human Feedback

The paper "Post-hoc Reward Calibration: A Case Study on Length Bias" explores the intricacies of mitigating biases present in Reward Models (RMs) employed in Reinforcement Learning from Human Feedback (RLHF). As LLMs increasingly align with human preferences, RMs are instrumental in translating qualitative feedback into quantitative training signals. However, RMs frequently develop biases, such as favoring longer outputs, which can result in misleading model performance evaluations and suboptimal LLM behavior.

Key Contributions

Bias Identification and Correction: The paper introduces a novel methodology termed Post-hoc Reward Calibration that aims to correct RM biases without necessitating additional data or retraining the models. The primary focus is on length bias, which has been shown to impact RM accuracy significantly.
Methodology:
- Decomposition and Estimation: The approach decomposes rewards into a true reward and a bias component. A Locally Weighted Regression (LWR) method estimates the bias, enabling the removal of biases from output evaluations.
- Implementation Across Multiple Settings: The effectiveness of the approach is validated through three distinct experimental settings: RM benchmark performance on the RewardBench dataset, RM evaluations as LLM judges, and RM use in LLM alignment.
Empirical Results:
- Demonstrated an average performance improvement of 3.11 across 33 RMs on RewardBench.
- Enhanced alignment with GPT-4 and human evaluations on the AlpacaEval benchmark.
- Significant reductions in length bias and performance enhancements in LLM alignment tasks, as measured by the Length-Controlled win rate.

Implications

The practical implications of this research extend to improving bias correction processes within RLHF, enhancing both the reliability and robustness of LLMs. The calibration technique developed is computationally efficient, adaptable to various biases, and scalable across different RM types and configurations. It promises substantial strides in the quality of LLM outputs by mitigating reward hacking phenomena and ensuring a more accurate reflection of true task performance.

Theoretical Insights

Theoretically, the paper offers a framework for understanding and addressing biases in reward signals, extending beyond length bias to other potential sources of bias such as stylistic preferences. By utilizing LWR, the method adapts to local data variations, thereby promising generality in addressing biases present in the diverse characteristics of training data.

Future Directions

The paper paves the way for further explorations in calibrating RMs against a broader scope of biases and characteristics. Future developments may involve refining calibration methods to enhance their adaptability and efficacy across varying datasets and model types, ultimately contributing to more nuanced and bias-resistant AI systems.

Overall, the research presents a robust and scalable solution for one of the pivotal challenges in aligning LLMs with human feedback, marking a significant advance in the ongoing efforts to optimize AI-human interactions.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Zeyu Huang (31 papers)
Zihan Qiu (19 papers)
Zili Wang (52 papers)
Edoardo M. Ponti (24 papers)
Ivan Titov (108 papers)

Related Papers

Find Related Papers

GitHub

GitHub - ZeroYuHuang/Reward-Calibration

Tweets

https://twitter.com/ZeroyuHuang/status/1843213172402184546