Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data (2404.14367v3)

Published 22 Apr 2024 in cs.LG

Abstract: Learning from preference labels plays a crucial role in fine-tuning LLMs. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.

PDF Abstract

Exploring Fine-Tuning Techniques for Optimizing Binary Preferences in LLMs

Overview of Study Focus and Methodology

This paper investigates the effectiveness of various fine-tuning techniques designed to align LLMs with specific binary preference outcomes, such as those derived from human evaluations or other AI models. The core motivation is to examine whether methods like on-policy reinforcement learning (RL), contrastive learning, and supervised learning effectively steer model outputs toward desired outcomes under practical constraints like limited data coverage and computational resources.

The authors develop a nuanced experimental setup spanning didactic bandit problems, synthetic LLM scenarios, and real-world LLM applications using datasets like AlpacaFarm and UltraFeedback. This enables a detailed exploration of how different training strategies—including on-policy sampling, sample reuse, and the use of negative gradients—impact the performance of fine-tuning processes.

Key Insights on Fine-Tuning Dynamics

The analysis hinges on several key aspects of the fine-tuning process:

On-Policy Sampling: The results suggest that on-policy methods, which leverage samples generated from the current policy iteration, are particularly effective in scenarios where the optimal responses are not well-represented in the training data. On-policy methods adapt more dynamically to the shifting landscape of the reward function, thereby capturing high-reward responses more efficiently.
Role of Negative Gradients: Training approaches that incorporate negative gradients—intended to penalistically adjust the likelihood of less preferred responses—tend to outperform those that do not, especially when the preferred data points are poorly represented in the initial model distribution.
Complementary Effects: The paper highlights the complementary benefits of combining on-policy sampling with methods that employ negative gradients. This combination tends to accelerate convergence and enables the model to more effectively focus on high-reward areas of the response space.
Sample Reuse: Delving into sample reuse strategies, the findings indicate that reusing on-policy samples can yield improvements in learning efficiency, although this comes with potential trade-offs related to overfitting if not managed carefully.

Theoretical Insights and Implications

From a theoretical perspective, the paper discusses how different objectives used in training (e.g., reverse KL-divergence versus forward KL-divergence) inherently lead models to prioritize certain types of learning behavior—specifically, "mode-seeking" versus "mode-covering." Mode-seeking, often associated with on-policy RL and negative gradient approaches, is shown to be more effective for quick adaptation in sparse reward environments.

Future Directions

The implications of this research are far-reaching for the development of more adaptable and efficient LLMs in various applications, from automated customer service to content generation. Future investigations could further refine these insights by quantifying the impact of reward model accuracy, exploring alternative forms of contrastive and minimax formulations, and examining the interplay between pre-training coverage and fine-tuning effectiveness.

This paper provides a foundational framework that both clarifies the complex dynamics of LLM fine-tuning and guides practical implementations, promising to inform a broad spectrum of future research and application developments in generative AI.