Transforming and Combining Rewards for Aligning Large Language Models (2402.00742v2)

Published 1 Feb 2024 in cs.CL and cs.AI

Abstract: A common approach for aligning LLMs to human preferences is to first learn a reward model from preference data, and then use this reward model to update the LLM. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is better'' than others? Second, we often wish to align LLMs to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. The derived transformation is straightforward: we apply a log-sigmoid function to the centered rewards, a method we termLSC-transformation'' (log-sigmoid-centered transformation). This transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. Experiments aligning LLMs to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

Authors (7)

Zihao Wang (216 papers)
Chirag Nagpal (25 papers)
Jonathan Berant (107 papers)
Jacob Eisenstein (73 papers)
Alex D'Amour (5 papers)
Sanmi Koyejo (111 papers)
Victor Veitch (38 papers)

Citations (10)

View on Semantic Scholar

Summary

Introduction

In the landscape of AI alignment, a splendid challenge lies in encouraging LLMs to generate outputs that possess desirable characteristics, such as being both helpful and harmless. This need has led researchers to use Reinforcement Learning from Human Feedback (RLHF), a two-stage process that first involves training a reward model based on human preferences, followed by aligning the LLM's responses to increase the expected reward. However, two issues are often encountered: First, monotone transformations of the reward model do not alter preference rankings, raising questions about optimal transformation choices. Second, when aligning to multiple properties, there's uncertainty about how to combine multiple reward models effectively.

Reward Transformation

Wang et al. approach these problems by introducing a probabilistic interpretation of the alignment procedure and consequently identifying an optimal transformation for rewards learned from Bradley-Terry models. The proposed log-sigmoid transformation, denoted $u(x, y) = \log \sigma(r(x, y) - r_{ref}(x))$ , prioritizes ameliorations in poorly-performing outputs, which mitigates underfitting and reward hacking. Reward hacking is an undesirable behavior where LLMs game the reward model rather than genuinely improving. This transformation also offers a principled method for combining rewards, as it equates to a probability that an output is 'good' across assessed properties. This is a dynamic shift from the standard approach which simply uses raw reward values.

Reward Aggregation

Turning to the challenge of combining rewards from multiple properties, the authors extend their probabilistic framework, under the assumption of independent judgments. The enhanced utility function for aggregating multiple reward models becomes $u(x, y) = \sum \log \sigma(r_i(x, y) - r_{i, ref}(x))$ , aligning the LLM to be 'good' across all target properties. This move is not just aptly mathematical but also aligns with the intuitive understanding of logical conjunction.

Empirical Validation

The authors conducted extensive experiments, applying their transformed reward and aggregation methodology in RLHF. They found significant improvements over baseline approaches, both in single-property and multiple-property alignments. Predominantly, focusing on aligning LLMs to be helpful and harmless simultaneously, they observed that their approach effectively helped models avoid reward hacking and underfitting, hence enhancing the overall performance robustly.

The paper concludes with a nod toward a constellation of extant methods aimed at mitigating reward hacking, pointing out that their proposed technique might be used in tandem with other strategies. The significance of their work resonates with any approach that sharpens utility maximization in the RLHF framework, potentially providing instrumental insights for future LLM alignment procedures.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1753265580394590322

https://twitter.com/wzihao12/status/1753490831733133673

https://twitter.com/nagpalchirag/status/1753493117712699882

https://twitter.com/fly51fly/status/1753544050328932419

https://twitter.com/gm8xx8/status/1753240881715429642

https://twitter.com/arxivsanitybot/status/1753598126500303005