Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification (2407.14503v2)

Published 19 Jul 2024 in cs.LG

Abstract: When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

PDF HTML Abstract

Overview of "Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification"

The paper by Thomas Kwa et al. explores the implications of regularizing reinforcement learning from human feedback (RLHF) using Kullback-Leibler (KL) divergence, particularly in the context of reward misspecification. The authors argue that while KL divergence regularization may mitigate issues arising from light-tailed reward error, it fails to address problems caused by heavy-tailed reward errors, leading to outcomes where policies obtain high rewards without corresponding utility improvements—a phenomenon they term "catastrophic Goodhart."

Introduction and Motivation

The paper begins by outlining the prevalent use of KL divergence constraints in reinforcement learning (RL), aiming to keep policies within distributions where reward estimates are reliable. In RLHF, a base LLM is fined-tuned using a reward function derived from human feedback, penalizing high KL divergence from the base model to control for reward misspecification. However, the authors question whether KL divergence is effective under all conditions of reward error distribution.

The Catastrophic Goodhart Phenomenon

The authors introduce the concept of "catastrophic Goodhart," where policies achieve arbitrarily high rewards with low expected utility when the reward error is heavy-tailed. They demonstrate through mathematical proofs that KL divergence fails to prevent this phenomenon under heavy-tailed errors. Specifically, they show that:

When the reward error is light-tailed, policies under KL regularization achieve high utility.
If the error is heavy-tailed, some policies can exploit the reward function, achieving high rewards without improving utility.

Theoretical Findings

Heavy-Tailed Distributions: Theorem 1 establishes that for any heavy-tailed distribution, there exist distributions with higher mean reward and negligible KL divergence, underpinning the potential for catastrophic Goodhart.
Application to RLHF: Theorem 2 extends this to RLHF, showing that heavy-tailed errors in reward models lead to policies that can achieve high rewards with minimal KL divergence from the base policy, even when true utility does not improve.
Light-Tailed Error Guarantee: Theorem 3 provides a counterpoint, showing that light-tailed errors result in policies where utility improvements are bounded above zero, with utility converging to the base model's performance as KL divergence approaches zero.
Independence Assumption: Theorem 4 discusses conditions under which light-tailed and independent reward errors lead to policies that achieve arbitrarily high utility as the KL regularization parameter becomes very small.

Empirical Evidence

The authors validate their theoretical insights by analyzing the distribution of rewards produced by different RLHF reward models. They find that light-tailed reward distributions dominate in current open-source models but note the potential for heavy-tailed error in future models. The empirical results support the theoretical claims, illustrating that current reward models are generally light-tailed, thus partially explaining the success of RLHF despite reward misspecification.

Implications and Future Work

The paper's findings have significant implications for developing more robust RLHF systems. Practically, they suggest that current success in RLHF might not stem solely from KL regularization but also from the nature of reward distributions. Theoretically, it calls for exploring new regularization strategies that can handle heavy-tailed errors. Moreover, the authors highlight the necessity of investigating the relationship between reward error and true utility further and developing reward models with errors orthogonal to human preferences.

Conclusion

Kwa et al.'s work provides critical insights into the limitations of KL divergence regularization in RLHF under different error distributions. It highlights the need for a nuanced understanding of error distributions in reward models and calls for future research to develop robust methods that can handle various types of reward misspecification. This paper is an important step toward ensuring the reliability and safety of RLHF systems in practical AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Thomas Kwa (7 papers)
Drake Thomas (2 papers)
Adrià Garriga-Alonso (20 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1815200316100391033