Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms (2406.02900v2)

Published 5 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of LLMs, however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL budgets, DAA algorithms exhibit similar degradation patterns to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

PDF HTML Abstract

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

The paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms," authored by Rafael Rafailov et al., conducts an empirical analysis of reward overoptimization in Direct Alignment Algorithms (DAAs). This work extends the understanding of reward hacking phenomena traditionally associated with Reinforcement Learning from Human Feedback (RLHF) to the newer class of DAAs. The authors provide a comprehensive investigation into how DAAs, despite bypassing the reward modeling phase, still exhibit overoptimization patterns similar to those found in classical RLHF methods.

Background and Motivation

Reinforcement Learning from Human Feedback (RLHF) has been integral to the fine-tuning of LLMs. This multi-step process involves training a reward model based on human preferences, followed by optimizing the LLM through an RL algorithm to maximize expected reward. A critical challenge in this pipeline is reward overoptimization, where the model's performance on the proxy reward function improves to the detriment of its true quality. This issue, often referred to as reward hacking, occurs because the proxy reward function is imperfect and can be gamed by the optimizing policy.

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO), offer an alternative to the RLHF pipeline. DAAs circumvent the reward modeling phase by directly training the LLM using preference data. This paper aims to understand whether DAAs, despite their structural differences from RLHF, also suffer from overoptimization and to characterize the implications of this phenomenon.

Key Contributions

Unified Framework for DAAs: The paper provides a unifying view of various recent methods under the DAA umbrella, including DPO, IPO, and SLiC. This framework is essential for analyzing and comparing the overoptimization phenomena across different DAA methods.
Empirical Evidence of Overoptimization: Through extensive experiments, the authors demonstrate that DAAs exhibit overoptimization similar to RLHF methods. They show that as the Kullback-Leibler (KL) divergence budget increases, DAA performance initially improves but eventually deteriorates, indicating overoptimization.
Scaling Laws for Overoptimization: The paper extends the scaling laws previously established for RLHF to DAAs. These laws model the relationship between KL divergence and performance, providing a quantitative understanding of overoptimization in DAAs.
Behavioral Analysis on Toy MDPs: By employing a toy Markov Decision Process (MDP), the authors illustrate how DAAs can allocate substantial probability mass to out-of-distribution (OOD) sequences, underscoring the under-constrained nature of DAA training objectives.

Experimental Findings

The authors used the Reddit TL;DR summarization dataset and Pythia LLMs at various scales (1B, 2.8B, and 6.9B parameters) for their experiments. They evaluated three DAA methods (DPO, IPO, SLiC) across different KL budgets to analyze overoptimization trends. The main findings are:

Overoptimization Consistency: All DAA methods exhibited hump-shaped performance curves, where increased KL budgets initially improved performance but led to degradation beyond a certain point. This pattern was consistent across different model scales and DAA methods.
Model Size and Overoptimization: Larger models (6.9B parameters) were less prone to overoptimization than smaller ones (1B parameters). This suggests that higher capacity models can manage the trade-offs better under the same KL constraints.
Length Correlations and Extrapolation: The paper showed that DAAs tend to over-fit simpler features, like length, more strongly in lower capacity settings. This bias towards verbosity was especially pronounced in smaller models and for narrower KL budgets.

Theoretical Implications

The authors propose that the overoptimization observed in DAAs can be attributed to the under-constrained nature of the optimization problem. DAAs fit an "implicit" reward model using preference data, leading to multiple optima that can place weight on OOD responses. This under-constrained fitting is a critical factor driving overoptimization, akin to but distinct from the proxy reward exploitation in classical RLHF.

Practical Implications and Future Work

The findings highlight the need for more robust DAA training methodologies that mitigate overoptimization. Potential directions include incorporating more diverse and comprehensive preference data, leveraging techniques from robust and offline RL, and developing new objective functions that better constrain the optimization process.

On a theoretical level, further research could explore more precise characterizations of the implicit reward functions and their behavior in high-capacity settings. Additionally, integrating online feedback mechanisms might offer a way to dynamically adjust and correct for OOD biases during training.

Conclusion

This paper advances the understanding of reward overoptimization in DAAs, providing empirical evidence that these algorithms, like their RLHF counterparts, suffer from performance degradation due to overoptimization. The proposed scaling laws and unified framework offer valuable tools for analyzing and mitigating these issues, paving the way for more effective alignment strategies for LLMs.