Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF (2401.16335v1)

Published 29 Jan 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns LLMs closely with human-centric values. The initial phase of RLHF involves learning human values using a reward model from ranking data. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS). The core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. Our empirical findings highlight the superior performance of this approach over the traditional methods.

PDF Abstract

Introduction

Reinforcement Learning from Human Feedback (RLHF) has become an increasingly prominent method for aligning LLMs with human-centric values and preferences. Despite significant empirical successes across various applications, the RLHF paradigm frequently encounters issues like reward overfitting and reward overoptimization. These phenomena not only impede the stability and reliability of LLM deployment but also raise concerns about the scalability of RLHF.

Understanding Reward Overfitting and Overoptimization

Recent work has provided insight into two major challenges within RLHF. Reward overfitting emerges when a model's performance on a reward learning task deteriorates rapidly after only a single epoch of training. The degradation is partly due to the inadequacy of cross-entropy loss with long-tailed preference datasets. Even simple 3-armed bandit problems demonstrate significant overfitting and overoptimization under these conditions. The root issue is that the empirical entropy loss minimizer can underrepresent rarely compared items in the dataset, leading to extreme and inappropriate reward estimations.

The other challenge, reward overoptimization, occurs in the policy learning stage. Typically, when the policy model is trained to maximize the learned reward, the ground-truth reward may initially increase but subsequently decreases as training continues. This phenomenon is notably observed when the policy diverges significantly from its original state in terms of KL divergence, which inadvertently steers the policy away from the true objective it is supposed to maximize.

Iterative Data Smoothing as a Solution

To mitigate these concerns, a new algorithm dubbed the Iterative Data Smoothing (IDS) is proposed, taking inspiration from the pessimism mechanism found in bandit learning. IDS iterates between updating the model and updating the data through soft labels, effectively smoothing the influence of infrequent observations. The mechanism discourages the over-emphasis on sporadically seen samples and concentrates on the more commonly observed pairs. It combines the advantage of soft labels and iterative learning, where the data and model inform each other through successive training epochs.

Theoretical analysis reveals that IDS, diverging from other approaches like the lower-confidence-bound-based algorithm, effectively learns the ground truth distribution for comparisons that garner sufficient observations while disregarding those infrequently seen. Experimental evidence confirms the algorithm's efficacy in both bandit and neural network settings.

Related Work and Future Directions

IDS builds upon an existing body of work in RLHF, Preference-based Reinforcement Learning, knowledge distillation, and ranking estimation from pairwise comparisons. Prior studies have highlighted similar challenges and proposed various solutions, yet few have offered a strategy that systematically addresses both overfitting and overoptimization in a single framework.

As we progress, it becomes critical to extend the IDS methodology to multi-armed bandits with more complex comparison scenarios and integrate it into neural network-based reward models. Exploring new ways to refine the algorithm and its practical implementation will enhance our understanding and application in broader RLHF tasks. Further theoretical analyses, particularly of the IDS algorithm's long-term convergence properties, are essential steps to solidify its place in the RLHF toolkit.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Banghua Zhu (38 papers)
Michael I. Jordan (438 papers)
Jiantao Jiao (83 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1752161513832689690

https://twitter.com/fly51fly/status/1752454102335033355

https://twitter.com/arxivsanitybot/status/1752321928440824203

https://twitter.com/gm8xx8/status/1752161894998184422

https://twitter.com/StatMLPapers/status/1752160663105909119

https://twitter.com/knishimae0531/status/1752539813738316133