Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (2404.10719v3)

Published 16 Apr 2024 in cs.CL
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Abstract: Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align LLMs with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at https://github.com/openpsi-project/ReaLHF.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Introduction

The alignment of LLMs with human preferences is a pivotal arena in AI research, particularly through Reinforcement Learning from Human Feedback (RLHF) approaches. This paper juxtaposes Direct Preference Optimization (DPO), a reward-free method, against Proximal Policy Optimization (PPO), a reward-based method, to evaluate their efficacy in aligning LLMs. Despite DPO's academic acclaim, we scrutinize its theoretical and empirical limitations and conduct a thorough analysis of PPO, uncovering key factors for optimizing its performance in RLHF. Moreover, our empirical benchmarks across diverse RLHF testbeds, including dialogue and code generation tasks, provide novel insights into the comparative advantages of PPO over DPO and other alignment methods.

Theoretical and Empirical Insights into DPO's Limitations

Our paper reveals significant theoretical limitations of DPO, demonstrating its susceptibility to biased solutions that exploit out-of-distribution (OOD) responses. DPO's potential to develop a biased policy preference emphasizes a fundamental challenge in ensuring model alignment with human preferences, particularly in the face of distribution shifts between model outputs and preference datasets. Empirical analyses further illuminate how performance degradation in DPO can be attributed to such distribution shifts, highlighting the critical need for mitigating these disparities to improve alignment efficacy.

Unveiling Key Factors for PPO's Efficacy in RLHF

The exploration into PPO's algorithmic components uncovers three key factors instrumental in enhancing its performance for LLM alignment: advantage normalization, large batch size, and exponential moving average update for the reference model. These factors significantly contribute to PPO's robustness and effectiveness, as demonstrated through comprehensive ablation studies. The employment of large batch size training, in particular, emerges as a pivotal element in mitigating performance degradation, thereby cementing PPO’s superiority in challenging RLHF applications such as code generation tasks.

Benchmarking DPO and PPO Across RLHF Testbeds

Our extensive experimental evaluations across various RLHF testbeds underscore PPO's superior performance in aligning LLMs across all cases, notably achieving state-of-the-art results in challenging code competitions. Contrary to initial expectations, DPO's efficacy is pragmatically limited, suffering under the weight of theoretical and empirical constraints, particularly in demanding tasks that test the boundaries of model alignment capabilities. The findings critically question the purported supremacy of DPO in LLM alignment, propelling a reevaluation of alignment strategies within the research community.

Implications and Future Directions

The comprehensive scrutiny of DPO and PPO within this paper not only challenges prevailing notions regarding LLM alignment methods but also opens new avenues for future research. The insights into DPO's limitations and the delineation of critical factors for enhancing PPO's performance offer a foundation for developing more robust and effective alignment strategies. As the AI field continues to progress, the lessons from this paper could guide the refinement of RLHF methodologies, ensuring that LLMs are more finely tuned to human preferences and societal values.

The evolving landscape of LLM alignment necessitates ongoing theoretical and empirical investigations to iteratively refine and develop methodologies that ensure models serve the broader interests of humanity. This paper represents a step forward in this journey, offering a critical evaluation of existing approaches and paving the way for future advancements in AI alignment research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Do language models know when they’re hallucinating references? CoRR, abs/2305.18248, 2023. doi: 10.48550/ARXIV.2305.18248. URL https://doi.org/10.48550/arXiv.2305.18248.
  2. What matters for on-policy deep actor-critic methods? A large-scale study. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=nIAxjsniDzg.
  3. Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/ARXIV.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
  4. Antropic. Claude, Jul 2023. URL https://claude.ai/chats.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  6. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  7. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  8. Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550/ARXIV.2307.15217. URL https://doi.org/10.48550/arXiv.2307.15217.
  9. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  10. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  11. Deep reinforcement learning from human preferences. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  12. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/ARXIV.2210.11416. URL https://doi.org/10.48550/arXiv.2210.11416.
  13. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
  14. RAFT: reward ranked finetuning for generative foundation model alignment. CoRR, abs/2304.06767, 2023. doi: 10.48550/ARXIV.2304.06767. URL https://doi.org/10.48550/arXiv.2304.06767.
  15. Implementation matters in deep RL: A case study on PPO and TRPO. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=r1etN1rtPB.
  16. Scaling laws for reward model overoptimization. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23h.html.
  17. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1856–1865. PMLR, 2018. URL http://proceedings.mlr.press/v80/haarnoja18b.html.
  18. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  19. Language models (mostly) know what they know. CoRR, abs/2207.05221, 2022. doi: 10.48550/ARXIV.2207.05221. URL https://doi.org/10.48550/arXiv.2207.05221.
  20. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  21. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  22. RLAIF: scaling reinforcement learning from human feedback with AI feedback. CoRR, abs/2309.00267, 2023. doi: 10.48550/ARXIV.2309.00267. URL https://doi.org/10.48550/arXiv.2309.00267.
  23. Deal or no deal? end-to-end learning for negotiation dialogues. arXiv preprint arXiv:1706.05125, 2017.
  24. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  25. Towards understanding and mitigating social biases in language models. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  6565–6576. PMLR, 2021. URL http://proceedings.mlr.press/v139/liang21a.html.
  26. Statistical rejection sampling improves preference optimization. CoRR, abs/2309.06657, 2023. doi: 10.48550/ARXIV.2309.06657. URL https://doi.org/10.48550/arXiv.2309.06657.
  27. MistralAI. Mistral 7B — mistral.ai. https://mistral.ai/news/announcing-mistral-7b/, 2023. [Accessed 18-01-2024].
  28. Asynchronous methods for deep reinforcement learning. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.  1928–1937. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/mniha16.html.
  29. OpenAI. Introducing chatgpt, Nov 2022. URL https://openai.com/blog/chatgpt.
  30. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  31. Training language models to follow instructions with human feedback. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  32. Instruction tuning with GPT-4. CoRR, abs/2304.03277, 2023. doi: 10.48550/ARXIV.2304.03277. URL https://doi.org/10.48550/arXiv.2304.03277.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290, 2023. doi: 10.48550/ARXIV.2305.18290. URL https://doi.org/10.48550/arXiv.2305.18290.
  35. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
  36. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=8aHzds2uUyB.
  37. Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
  38. Russell, S. Human-compatible artificial intelligence. In Muggleton, S. H. and Chater, N. (eds.), Human-Like Machine Intelligence, pp.  3–23. Oxford University Press, 2022. doi: 10.1093/OSO/9780198862536.003.0001. URL https://doi.org/10.1093/oso/9780198862536.003.0001.
  39. Artificial Intelligence: A Modern Approach (4th Edition). Pearson, 2020. ISBN 9780134610993. URL http://aima.cs.berkeley.edu/.
  40. Efficient RLHF: reducing the memory usage of PPO. CoRR, abs/2309.00754, 2023. doi: 10.48550/ARXIV.2309.00754. URL https://doi.org/10.48550/arXiv.2309.00754.
  41. High-dimensional continuous control using generalized advantage estimation. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506.02438.
  42. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  43. BLEURT: learning robust metrics for text generation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7881–7892. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.704. URL https://doi.org/10.18653/v1/2020.acl-main.704.
  44. The woman worked as a babysitter: On biases in language generation. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  3405–3410. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1339. URL https://doi.org/10.18653/v1/D19-1339.
  45. Large language models can be easily distracted by irrelevant context. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  31210–31227. PMLR, 2023. URL https://proceedings.mlr.press/v202/shi23a.html.
  46. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
  47. Preference ranking optimization for human alignment. CoRR, abs/2306.17492, 2023. doi: 10.48550/ARXIV.2306.17492. URL https://doi.org/10.48550/arXiv.2306.17492.
  48. Learning to summarize with human feedback. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html.
  49. Stanford alpaca: An instruction-following llama model, 2023.
  50. UL2: unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=6ruVLB727MC.
  51. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  52. RLCD: reinforcement learning from contrast distillation for language model alignment. CoRR, abs/2307.12950, 2023. doi: 10.48550/ARXIV.2307.12950. URL https://doi.org/10.48550/arXiv.2307.12950.
  53. Deepspeed-chat: Easy, fast and affordable RLHF training of chatgpt-like models at all scales. CoRR, abs/2308.01320, 2023. doi: 10.48550/ARXIV.2308.01320. URL https://doi.org/10.48550/arXiv.2308.01320.
  54. The surprising effectiveness of PPO in cooperative multi-agent games. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9c1535a02f0ce079433344e14d910597-Abstract-Datasets_and_Benchmarks.html.
  55. Self-rewarding language models. CoRR, abs/2401.10020, 2024. URL https://arxiv.org/pdf/2401.10020.pdf.
  56. RRHF: rank responses to align language models with human feedback without tears. CoRR, abs/2304.05302, 2023. doi: 10.48550/ARXIV.2304.05302. URL https://doi.org/10.48550/arXiv.2304.05302.
  57. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  58. Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964, 2023. doi: 10.48550/ARXIV.2307.04964. URL https://doi.org/10.48550/arXiv.2307.04964.
  59. Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shusheng Xu (11 papers)
  2. Wei Fu (59 papers)
  3. Jiaxuan Gao (14 papers)
  4. Wenjie Ye (8 papers)
  5. Weilin Liu (6 papers)
  6. Zhiyu Mei (6 papers)
  7. Guangju Wang (5 papers)
  8. Chao Yu (116 papers)
  9. Yi Wu (171 papers)
Citations (73)
Youtube Logo Streamline Icon: https://streamlinehq.com