Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF (2401.16335v1)

Published 29 Jan 2024 in cs.LG, cs.AI, cs.CL, and stat.ML
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns LLMs closely with human-centric values. The initial phase of RLHF involves learning human values using a reward model from ranking data. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS). The core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. Our empirical findings highlight the superior performance of this approach over the traditional methods.

Introduction

Reinforcement Learning from Human Feedback (RLHF) has become an increasingly prominent method for aligning LLMs with human-centric values and preferences. Despite significant empirical successes across various applications, the RLHF paradigm frequently encounters issues like reward overfitting and reward overoptimization. These phenomena not only impede the stability and reliability of LLM deployment but also raise concerns about the scalability of RLHF.

Understanding Reward Overfitting and Overoptimization

Recent work has provided insight into two major challenges within RLHF. Reward overfitting emerges when a model's performance on a reward learning task deteriorates rapidly after only a single epoch of training. The degradation is partly due to the inadequacy of cross-entropy loss with long-tailed preference datasets. Even simple 3-armed bandit problems demonstrate significant overfitting and overoptimization under these conditions. The root issue is that the empirical entropy loss minimizer can underrepresent rarely compared items in the dataset, leading to extreme and inappropriate reward estimations.

The other challenge, reward overoptimization, occurs in the policy learning stage. Typically, when the policy model is trained to maximize the learned reward, the ground-truth reward may initially increase but subsequently decreases as training continues. This phenomenon is notably observed when the policy diverges significantly from its original state in terms of KL divergence, which inadvertently steers the policy away from the true objective it is supposed to maximize.

Iterative Data Smoothing as a Solution

To mitigate these concerns, a new algorithm dubbed the Iterative Data Smoothing (IDS) is proposed, taking inspiration from the pessimism mechanism found in bandit learning. IDS iterates between updating the model and updating the data through soft labels, effectively smoothing the influence of infrequent observations. The mechanism discourages the over-emphasis on sporadically seen samples and concentrates on the more commonly observed pairs. It combines the advantage of soft labels and iterative learning, where the data and model inform each other through successive training epochs.

Theoretical analysis reveals that IDS, diverging from other approaches like the lower-confidence-bound-based algorithm, effectively learns the ground truth distribution for comparisons that garner sufficient observations while disregarding those infrequently seen. Experimental evidence confirms the algorithm's efficacy in both bandit and neural network settings.

Related Work and Future Directions

IDS builds upon an existing body of work in RLHF, Preference-based Reinforcement Learning, knowledge distillation, and ranking estimation from pairwise comparisons. Prior studies have highlighted similar challenges and proposed various solutions, yet few have offered a strategy that systematically addresses both overfitting and overoptimization in a single framework.

As we progress, it becomes critical to extend the IDS methodology to multi-armed bandits with more complex comparison scenarios and integrate it into neural network-based reward models. Exploring new ways to refine the algorithm and its practical implementation will enhance our understanding and application in broader RLHF tasks. Further theoretical analyses, particularly of the IDS algorithm's long-term convergence properties, are essential steps to solidify its place in the RLHF toolkit.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Banghua Zhu (38 papers)
  2. Michael I. Jordan (438 papers)
  3. Jiantao Jiao (83 papers)
Citations (15)