The Importance of Online Data: Understanding Preference Fine-tuning via Coverage (2406.01462v2)

Published 3 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Learning from human preference data has emerged as the dominant paradigm for fine-tuning LLMs. The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy, but a weaker partial coverage condition suffices for online RL methods. This separation provides one explanation of why online RL methods can perform better than offline methods, especially when the offline preference data is not diverse enough. Finally, motivated by our preceding theoretical observations, we derive a hybrid preference optimization (HyPO) algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization. Theoretically and empirically, we demonstrate that HyPO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency.

PDF HTML Abstract

Analyzing "Understanding Preference Fine-Tuning Through the Lens of Coverage"

The paper "Understanding Preference Fine-Tuning Through the Lens of Coverage" systematically explores the dichotomy between online and offline methodologies for preference fine-tuning of LLMs, particularly in the context of reinforcement learning from human feedback (RLHF) and contrastive methods. The authors aim to address why online RL methods generally perform better than offline contrastive approaches when human preference data is limited or non-diverse.

Theoretical Insights on Coverage

The paper argues that the underlying coverage of the dataset plays a crucial role in the efficacy of preference fine-tuning methods. The authors introduce the concept of dataset coverage within reinforcement learning, specifying that for offline contrastive methods like Direct Preference Optimization (DPO) to converge to an optimal policy, a global coverage condition is necessary. This condition implies that the offline data distribution must adequately span the support of possible responses. In contrast, online RL methods like Proximal Policy Optimization (PPO) can suffice even under a weaker local coverage condition, where coverage is within a neighborhood of the reference policy, facilitated by on-policy sampling. This separation theoretically explains the observed empirical superiority of RLHF over offline methods.

Hybrid Preference Optimization (HyPO)

Motivated by these theoretical insights, the paper introduces a novel approach named Hybrid Preference Optimization (HyPO). HyPO amalgamates offline and online data, applying offline methods for preference ranking while utilizing online rollouts for KL divergence regularization. This hybrid algorithm seeks to leverage the benefits of offline computational efficiency and the superior exploration of online sampling.

Empirical Validation

Empirical evaluations on the TL;DR summarization task demonstrate that HyPO outperforms the traditional DPO method by achieving higher GPT4 win-rates and lower reverse KL divergences relative to the reference policy. While HyPO trails behind PPO's performance, it retains the complexity and computational benefits of offline methods by obviating the need for separate reward and critic models.

Function Approximation and Extrapolation

An intriguing theoretical observation made concerns the extrapolation during fine-tuning. The authors examine scenarios wherein function approximation methods allow models like DPO to increase the likelihood of actions not observed in the training dataset, underlining the importance of adequate function approximation regimes for achieving generalization beyond training data.

Implications and Future Directions

This paper provides substantial insights into both the practical and theoretical aspects of LLM fine-tuning. It underscores the importance of coverage and online sampling for achieving optimal outcomes and provides a path forward for developing more efficient fine-tuning techniques that blend offline computational efficiency with robust online exploration. Future research could explore optimizing reward models and exploring further hybrid strategies, ensuring broader applicability to complex, real-world tasks.

Conclusion

The paper robustly delineates the structural differences between two predominant fine-tuning methodologies. By integrating offline and online elements into a cohesive framework, the authors propose a viable path forward that can harness the strengths inherent in each approach. As AI and LLMs continue to evolve, such integrative strategies will be crucial for both performance enhancements and computational sustainability.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yuda Song (22 papers)
Gokul Swamy (26 papers)
Aarti Singh (98 papers)
J. Andrew Bagnell (64 papers)
Wen Sun (124 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/yus167/status/1813596797438464481

https://twitter.com/g_k_swamy/status/1850872409559118246