Bayesian Reward Models for LLM Alignment (2402.13210v2)

Published 20 Feb 2024 in cs.LG

Abstract: To ensure that LLM responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of-$n$ (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible to reward overoptimization or `hacking', where responses receive high rewards due to imperfections in the reward model rather than true preference, particularly as prompts or responses deviate from the training data. To address these challenges, we propose to train a Bayesian reward model, which signals higher uncertainty further from the training data distribution. We trained Bayesian reward models using Laplace approximation on LoRA weights, and found that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.

PDF HTML Abstract

Bayesian Reward Models for LLM Alignment: Mitigating Reward Overoptimization through Uncertainty Quantification

Introduction

In the domain of generative AI and LLMs, ensuring alignment with human preferences constitutes a critical yet challenging objective. Traditional methodologies pivot around training reward models on human preference data, followed by leveraging these models either to select optimally aligned responses from a set of candidates or to fine-tune LLM policies via reinforcement learning from human feedback (RLHF). Despite the functional efficacy of these strategies, they inherently risk reward overoptimization—where LLMs exploit imperfections in the reward model to produce ostensibly high-reward responses that do not genuinely align with human preferences, especially as responses meander into out-of-distribution (OOD) territories. This paper addresses such challenges by proposing the adoption of Bayesian reward models, specifically utilizing Laplace-LoRA for enhancing the reliability of reward estimations by incorporating uncertainty quantification.

The Issue with Overoptimization

Reward models, trained through finite human preference datasets, harbor inaccuracies that may inadvertently promote reward overoptimization or hacking in BoN (best-of-n) sampling or RLHF scenarios. This typically manifests as the model generating responses with artificially inflated rewards that do not align with true human preferences, particularly exacerbated in OOD cases where the reward model’s training data is sparse. The paper underscores the importance of overcoming this challenge to prevent performance degradation and safety issues in practical applications of LLMs.

Bayesian Techniques for Uncertainty Estimation

The core contribution of this paper revolves around employing Bayesian deep learning as a mechanism to address overoptimization. The methodology hinges on the Bayesian Low-Rank Adaptation (LoRA) technique, or Laplace-LoRA, which adaptively furnishes LLMs with uncertainty estimates across their response spectrums. By doing so, it not only fortifies the model against overconfidence but also makes strides toward mitigating overoptimization - particularly in scenarios where the model encounters OOD data. The uncertainty quantification offered by Bayesian methods, especially with Laplace-LoRA as elucidated in Yang et al. (2024), presents a scalable and parameter-efficient avenue for enhancing LLM robustness and safety.

Methodology

The paper explores a sophisticated methodology for integrating uncertainty quantification into reward modeling through Laplace-LoRA. This approach estimates a Gaussian distribution over the reward outputs, enabling the model to understand and adjust for the uncertainty inherent in its predictions. Moreover, it proposes utilizing either a standard deviation-based or a variance-based penalty to incorporate these uncertainty estimates into the reward estimation process, which effectively adjusts the reward predictions to account for their associated uncertainties. By honing in on these nuances, the methodology advocates for a more refined and reliable allocation of rewards, particularly proving detrimental to reward exploitation tendencies in OOD scenarios.

Empirical Validation

Through a series of experiments involving comparisons between proxy and gold-standard reward models across varying levels of KL divergence, the paper empirically demonstrates the efficacy of incorporating uncertainty penalties into the reward estimation process. It explicitly shows that Laplace-LoRA significantly mitigates the issue of reward overoptimization in BoN sampling, underscoring the method’s practical viability and effectiveness. Notably, the empirical insights validate the potential of variance-based penalty methods, highlighting their slightly superior performance in conditions of lower KL divergence.

Conclusion and Future Perspectives

The adoption of Bayesian reward models, epitomized by the Laplace-LoRA technique, marks a significant advance in the quest for aligning LLMs with human preferences while mitigating reward overoptimization. This paper not only elucidates a critical challenge in the field but also proposes a robust, theoretically underpinned, and empirically validated solution. Looking forward, it opens avenues for further exploration in enhancing the safety and reliability of LLMs, potentially influencing future developments in generative AI through a nuanced understanding and application of Bayesian methods for uncertainty quantification.