Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bayesian Reward Models for LLM Alignment (2402.13210v2)

Published 20 Feb 2024 in cs.LG
Bayesian Reward Models for LLM Alignment

Abstract: To ensure that LLM responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of-$n$ (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible to reward overoptimization or `hacking', where responses receive high rewards due to imperfections in the reward model rather than true preference, particularly as prompts or responses deviate from the training data. To address these challenges, we propose to train a Bayesian reward model, which signals higher uncertainty further from the training data distribution. We trained Bayesian reward models using Laplace approximation on LoRA weights, and found that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.

Bayesian Reward Models for LLM Alignment: Mitigating Reward Overoptimization through Uncertainty Quantification

Introduction

In the domain of generative AI and LLMs, ensuring alignment with human preferences constitutes a critical yet challenging objective. Traditional methodologies pivot around training reward models on human preference data, followed by leveraging these models either to select optimally aligned responses from a set of candidates or to fine-tune LLM policies via reinforcement learning from human feedback (RLHF). Despite the functional efficacy of these strategies, they inherently risk reward overoptimization—where LLMs exploit imperfections in the reward model to produce ostensibly high-reward responses that do not genuinely align with human preferences, especially as responses meander into out-of-distribution (OOD) territories. This paper addresses such challenges by proposing the adoption of Bayesian reward models, specifically utilizing Laplace-LoRA for enhancing the reliability of reward estimations by incorporating uncertainty quantification.

The Issue with Overoptimization

Reward models, trained through finite human preference datasets, harbor inaccuracies that may inadvertently promote reward overoptimization or hacking in BoN (best-of-n) sampling or RLHF scenarios. This typically manifests as the model generating responses with artificially inflated rewards that do not align with true human preferences, particularly exacerbated in OOD cases where the reward model’s training data is sparse. The paper underscores the importance of overcoming this challenge to prevent performance degradation and safety issues in practical applications of LLMs.

Bayesian Techniques for Uncertainty Estimation

The core contribution of this paper revolves around employing Bayesian deep learning as a mechanism to address overoptimization. The methodology hinges on the Bayesian Low-Rank Adaptation (LoRA) technique, or Laplace-LoRA, which adaptively furnishes LLMs with uncertainty estimates across their response spectrums. By doing so, it not only fortifies the model against overconfidence but also makes strides toward mitigating overoptimization - particularly in scenarios where the model encounters OOD data. The uncertainty quantification offered by Bayesian methods, especially with Laplace-LoRA as elucidated in Yang et al. (2024), presents a scalable and parameter-efficient avenue for enhancing LLM robustness and safety.

Methodology

The paper explores a sophisticated methodology for integrating uncertainty quantification into reward modeling through Laplace-LoRA. This approach estimates a Gaussian distribution over the reward outputs, enabling the model to understand and adjust for the uncertainty inherent in its predictions. Moreover, it proposes utilizing either a standard deviation-based or a variance-based penalty to incorporate these uncertainty estimates into the reward estimation process, which effectively adjusts the reward predictions to account for their associated uncertainties. By honing in on these nuances, the methodology advocates for a more refined and reliable allocation of rewards, particularly proving detrimental to reward exploitation tendencies in OOD scenarios.

Empirical Validation

Through a series of experiments involving comparisons between proxy and gold-standard reward models across varying levels of KL divergence, the paper empirically demonstrates the efficacy of incorporating uncertainty penalties into the reward estimation process. It explicitly shows that Laplace-LoRA significantly mitigates the issue of reward overoptimization in BoN sampling, underscoring the method’s practical viability and effectiveness. Notably, the empirical insights validate the potential of variance-based penalty methods, highlighting their slightly superior performance in conditions of lower KL divergence.

Conclusion and Future Perspectives

The adoption of Bayesian reward models, epitomized by the Laplace-LoRA technique, marks a significant advance in the quest for aligning LLMs with human preferences while mitigating reward overoptimization. This paper not only elucidates a critical challenge in the field but also proposes a robust, theoretically underpinned, and empirically validated solution. Looking forward, it opens avenues for further exploration in enhancing the safety and reliability of LLMs, potentially influencing future developments in generative AI through a nuanced understanding and application of Bayesian methods for uncertainty quantification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Deep kernel processes. In International Conference on Machine Learning, pp.  130–140. PMLR, 2021.
  2. Adapting the linearised laplace model evidence for modern deep learning. In ICML, 2022.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  5. Weight uncertainty in neural network. In International conference on machine learning, pp.  1613–1622. PMLR, 2015.
  6. Disagreement-regularized imitation learning. In International Conference on Learning Representations, 2019.
  7. Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319, 2024.
  8. Reward model ensembles help mitigate overoptimization. In ICLR, 2024.
  9. Laplace redux-effortless bayesian deep learning. NeurIPS, 2021.
  10. Accelerated linearized laplace approximation for bayesian deep learning. NeurIPS, 2022.
  11. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
  12. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  13. ’in-between’uncertainty in bayesian neural networks. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2019.
  14. Bayesian neural network priors revisited. arXiv preprint arXiv:2102.06571, 2021.
  15. Scaling laws for reward model overoptimization. In ICML, pp.  10835–10866, 2023.
  16. Uncertainty estimation for language reward models. arXiv preprint arXiv:2203.07472, 2022.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Improving predictions of bayesian neural nets via local linearization. In AISTAT, 2021.
  19. Being Bayesian, even just a bit, fixes overconfidence in relu networks. In ICML, 2020.
  20. A sober look at llms for material discovery: Are they actually good for bayesian optimization over molecules? arXiv preprint arXiv:2402.05015, 2024.
  21. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
  22. Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
  23. David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 1992.
  24. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  25. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  26. Global inducing point variational posteriors for bayesian neural networks and deep gaussian processes. In International Conference on Machine Learning, pp.  8248–8259. PMLR, 2021.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  28. Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
  29. Dept: Decomposed prompt tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2309.05173, 2023.
  30. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  32. Lora ensembles for large language model fine-tuning. arXiv preprint arXiv:2310.00035, 2023.
  33. Bayesian low-rank adaptation for large language models. In ICLR, 2024.
  34. Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2024.
  35. Cyclical stochastic gradient mcmc for bayesian deep learning. arXiv preprint arXiv:1902.03932, 2019.
  36. Improving reinforcement learning from human feedback with efficient reward model ensemble. arXiv preprint arXiv:2401.16635, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Adam X. Yang (6 papers)
  2. Maxime Robeyns (6 papers)
  3. Thomas Coste (5 papers)
  4. Jun Wang (990 papers)
  5. Haitham Bou-Ammar (30 papers)
  6. Laurence Aitchison (66 papers)
  7. Zhengyan Shi (7 papers)
Citations (11)