Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.3k 2

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (2404.10719v3)

Published 16 Apr 2024 in cs.CL

Abstract: Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align LLMs with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at https://github.com/openpsi-project/ReaLHF.

PDF HTML Abstract

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Introduction

The alignment of LLMs with human preferences is a pivotal arena in AI research, particularly through Reinforcement Learning from Human Feedback (RLHF) approaches. This paper juxtaposes Direct Preference Optimization (DPO), a reward-free method, against Proximal Policy Optimization (PPO), a reward-based method, to evaluate their efficacy in aligning LLMs. Despite DPO's academic acclaim, we scrutinize its theoretical and empirical limitations and conduct a thorough analysis of PPO, uncovering key factors for optimizing its performance in RLHF. Moreover, our empirical benchmarks across diverse RLHF testbeds, including dialogue and code generation tasks, provide novel insights into the comparative advantages of PPO over DPO and other alignment methods.

Theoretical and Empirical Insights into DPO's Limitations

Our paper reveals significant theoretical limitations of DPO, demonstrating its susceptibility to biased solutions that exploit out-of-distribution (OOD) responses. DPO's potential to develop a biased policy preference emphasizes a fundamental challenge in ensuring model alignment with human preferences, particularly in the face of distribution shifts between model outputs and preference datasets. Empirical analyses further illuminate how performance degradation in DPO can be attributed to such distribution shifts, highlighting the critical need for mitigating these disparities to improve alignment efficacy.

Unveiling Key Factors for PPO's Efficacy in RLHF

The exploration into PPO's algorithmic components uncovers three key factors instrumental in enhancing its performance for LLM alignment: advantage normalization, large batch size, and exponential moving average update for the reference model. These factors significantly contribute to PPO's robustness and effectiveness, as demonstrated through comprehensive ablation studies. The employment of large batch size training, in particular, emerges as a pivotal element in mitigating performance degradation, thereby cementing PPO’s superiority in challenging RLHF applications such as code generation tasks.

Benchmarking DPO and PPO Across RLHF Testbeds

Our extensive experimental evaluations across various RLHF testbeds underscore PPO's superior performance in aligning LLMs across all cases, notably achieving state-of-the-art results in challenging code competitions. Contrary to initial expectations, DPO's efficacy is pragmatically limited, suffering under the weight of theoretical and empirical constraints, particularly in demanding tasks that test the boundaries of model alignment capabilities. The findings critically question the purported supremacy of DPO in LLM alignment, propelling a reevaluation of alignment strategies within the research community.

Implications and Future Directions

The comprehensive scrutiny of DPO and PPO within this paper not only challenges prevailing notions regarding LLM alignment methods but also opens new avenues for future research. The insights into DPO's limitations and the delineation of critical factors for enhancing PPO's performance offer a foundation for developing more robust and effective alignment strategies. As the AI field continues to progress, the lessons from this paper could guide the refinement of RLHF methodologies, ensuring that LLMs are more finely tuned to human preferences and societal values.

The evolving landscape of LLM alignment necessitates ongoing theoretical and empirical investigations to iteratively refine and develop methodologies that ensure models serve the broader interests of humanity. This paper represents a step forward in this journey, offering a critical evaluation of existing approaches and paving the way for future advancements in AI alignment research.

PDF Markdown Bookmark Chat (Pro)

References (59)

Authors (9)

Shusheng Xu (11 papers)
Wei Fu (59 papers)
Jiaxuan Gao (14 papers)
Wenjie Ye (8 papers)
Weilin Liu (6 papers)
Zhiyu Mei (6 papers)
Guangju Wang (5 papers)
Chao Yu (116 papers)
Yi Wu (171 papers)

Citations (73)

View on Semantic Scholar

Tweets

https://twitter.com/rasbt/status/1781307054675657130

https://twitter.com/natolambert/status/1780610637011599737

https://twitter.com/kalomaze/status/1816957293650485492

https://twitter.com/kalomaze/status/1840534708246982735

https://twitter.com/LChoshen/status/1780609317906854296

https://twitter.com/kalomaze/status/1858107619694485684

YouTube

Show All Videos