Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback (2406.09279v2)

Published 13 Jun 2024 in cs.CL

Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern LLMs (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (https://github.com/hamishivi/EasyLM) and evaluating (https://github.com/allenai/open-instruct) our models, along with the models and datasets themselves (https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618).

PDF HTML Abstract

Disentangling Best Practices for Learning from Preference Feedback in LLMs

In their paper, "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback," Ivison et al. conduct a comprehensive paper on preference-based learning with the goal of understanding the relative impact of various components on model performance. This investigation is critical given the widespread use of preference-based learning to enhance LLMs (LMs) across numerous domains, yet with considerable variability in approaches. The paper primarily contrasts two well-known methods, Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), and further examines the effects of different preference datasets, reward model scales, and policy training prompts.

Key Findings

Impact of Preference Data: Preference data, especially synthetically generated datasets with fine-grained annotations like UltraFeedback, showed the largest improvements in downstream tasks. In particular, synthetic data outperformed human-annotated and web-scraped datasets, enhancing instruction following and truthfulness substantially (up to 8%). The paper highlights that the quality of preference data—defined by the chosen/rejected pairs—is paramount.
Algorithm Comparison (PPO vs. DPO): PPO outperforms DPO across various evaluations, reflecting an average improvement of 0.7 points. The primary gains were noted in reasoning, coding, and safety, where PPO’s structured, online sampling approach enhances the policy model's capabilities. Conversely, the simplicity and efficiency of DPO come at the cost of lower performance due to its static nature tied to pre-generated data.
Influence of Reward Models: Larger reward models, particularly when trained with extensive, high-quality mixed data, demonstrated significant improvements in direct evaluation metrics. However, these improvements translated marginally to downstream performance in policy models, except in mathematical problem-solving (GSM). This discrepancy suggests that while better reward models can capture nuanced preferences effectively, integrating these gains into a generalized policy model remains complex.
Role of Policy Training Prompts: Tailoring policy training prompts to the specific domain (e.g., math prompts for GSM evaluations) showed marked improvements in respective specialized tasks. Yet, for general enhancements across various domains, altering the prompt distribution away from balanced, diverse sources such as UltraFeedback resulted in minimal to negative effects.

Implications and Future Directions

The implications of these findings are substantial for both theoretical advancements and practical implementations in AI development:

Theoretical Insights: The paper elucidates the subtle balance required in preference-based learning between data quality, learning algorithms, and the optimization of reward models. It underscores the importance of synthetic datasets in capturing nuanced feedback and raises questions about the limitations of current reward model integration within policy training.
Practical Applications: From an engineering standpoint, practitioners are provided with a clear "recipe"—utilizing synthetic preference data, leveraging PPO for policy training with large-scale reward models, and carefully considering the targeted nature of training prompts. This recipe is particularly valuable for developers aiming to fine-tune LLMs for specific applications.

Speculations on Future Developments

Looking forward, a few potential areas for improvement and exploration emerge:

Enhanced Reward Model Integration: Bridging the gap between improved reward model performance and downstream policy gains is critical. Future work might explore dynamic reward modeling that adapts continuously during policy training to better align policy updates with nuanced reward signals.
Dataset Composition and Diversity: Further research could dissect the components of synthetic datasets that drive their superior performance. Understanding the intricate balances within these datasets can aid in crafting even more effective preference data.
Algorithmic Innovations: While PPO has shown superior results, exploring hybrid models that combine the efficiency of DPO with the exploration benefits of PPO might yield even better performance. This could involve semi-online methods or dynamic data sampling strategies that adapt based on model performance metrics.

Conclusion

The paper by Ivison et al. provides a meticulous dissection of preference-based learning in LLMs. Their findings offer a robust framework for current AI practitioners and open intriguing avenues for future academic inquiry, ensuring that advancements in preference learning continue to refine and elevate the capabilities of LLMs across diverse applications.