Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive (2402.13228v2)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.LG
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Abstract: Direct Preference Optimisation (DPO) is effective at significantly improving the performance of LLMs on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.

Enhancing Preference Optimization in LLMs with DPO-Positive

Introduction

The evolution of LLMs has underscored the critical importance of aligning these models with human preferences to ensure their fluency and effectiveness across various tasks. In response to this need, Direct Preference Optimization (DPO) has emerged as a key technique, leveraging preferred and dispreferred data pairs to model the relative probability of one response over another. However, this paper identifies a notable limitation within the standard DPO approach – a potential reduction in the model’s likelihood for the preferred examples, particularly evident in datasets with small edit distances between pairs of completions. To address this, we introduce DPO-Positive (DPOP), a novel loss function and training methodology designed to overcome this failure mode, demonstrating significant improvements over DPO across diverse datasets and tasks.

Background and Related Work

The development of LLMs has been significantly aided by methods capable of integrating human-written completions or human-preferred completions to fine-tune models for enhanced performance on downstream tasks. Among these methods, reinforcement learning from human feedback (RLHF) and DPO are prominent. DPO, especially, has gained traction for its ability to directly optimize preferences without explicit reward function learning, focusing on maximizing the likelihood of preferred completions relative to dispreferred ones.

Failure Mode of DPO

A deeper analysis into the functionality of DPO reveals a critical oversight: the potential for a reduced likelihood of preferred examples, especially in scenarios where the preferred and dispreferred completions closely resemble each other textually. This paper theorizes and empirically validates that in datasets where the edit distance between preference pairs is minimal, the standard DPO methodology could inadvertently deprioritize the preferred completions, leading to a degradation in model performance.

Introducing DPO-Positive (DPOP)

To counteract the identified failure mode of DPO, DPOP introduces a corrective penalty term to the loss function, ensuring that the model's likelihood for preferred completions does not diminish. This innovation not only preserves the integrity of the preferred data but also elevates DPOP's effectiveness across a broad spectrum of datasets, including those with significant differences between completion pairs. The empirical results underscore DPOP's superior performance, notably in the creation of the Smaug class of models which exhibit state-of-the-art open-source achievements.

Contribution and Results

The paper’s contributions are manifold, offering a theoretical and empirical dissection of a DPO failure mode, the formulation of DPOP as a resilient alternative, and the developmental groundwork for the Smaug class of models, which push the boundaries of open-source LLM performance. Particularly, Smaug-72B sets a new benchmark by achieving an unprecedented average accuracy rate on the HuggingFace Open LLM Leaderboard.

Conclusion and Future Directions

While DPOP marks a significant stride toward refining preference optimization in LLMs, this work also acknowledges the limitations inherent in the scale and linguistic focus of tested datasets. The research paves the way for further explorations into preference-based LLM fine-tuning, stressing the potential for DPOP's application across a more diverse range of datasets, including non-English languages. The paper's findings not only contribute to the ongoing development of more accurate and aligned LLMs but also highlight the importance of continual evaluation and adaptation of existing methodologies to address emerging challenges.

Limitations and Impact

Acknowledging the potential misuse of such advanced techniques and models for generating harmful content is crucial. Yet, the focus on mathematical and reasoning contexts, coupled with a deeper understanding of preference optimization, leans towards a positive societal impact. The release of the Smaug models, while being a significant contribution to the AI research community, is done with the consideration of their comparative performance to prioritize safety and responsible use.

This work stands as a testament to the dynamic nature of AI research, where the detection of methodological weaknesses becomes the foundation for innovation, driving the field towards the development of LLMs that are not only powerful but also closely aligned with human values and preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. 01.AI. Yi-34b-200k, 2024. URL https://huggingface.co/01-ai/Yi-34B-200K.
  2. AllenAI. Ultrafeedback binarized clean, 2024. URL https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned.
  3. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023.
  4. A general theoretical paradigm to understand learning from human preferences, 2023.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  7. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. doi: 10.2307/2334029.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  11. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  14. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  15. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  17. Training verifiers to solve math word problems, 2021.
  18. Jon Durbin. Bagel-34b-v0.2, 2024a. URL https://huggingface.co/jondurbin/bagel-34b-v0.2.
  19. Jon Durbin. Truthy dpo, 2024b. URL https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1.
  20. Shahul Es. Orca-chat, 2024. URL https://huggingface.co/datasets/shahules786/orca-chat.
  21. Human-centered loss functions (halos). Technical report, Contextual AI, 2023.
  22. A framework for few-shot language model evaluation, September 2021.
  23. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742, 2006. doi: 10.1109/CVPR.2006.100.
  24. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  25. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  26. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391, 2023.
  27. Intel. Orca dpo pairs, 2024. URL https://huggingface.co/datasets/Intel/orca_dpo_pairs.
  28. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  29. Mistral 7b, 2023.
  30. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627, 2018.
  31. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  33. R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
  34. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  35. Moreh. Momo-72b-lora-1.8.7-dpo, 2024. URL https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO.
  36. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  37. OpenAI. Gpt-4 technical report. Technical Report, 2023.
  38. Training language models to follow instructions with human feedback, 2022.
  39. Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  41. Direct preference optimization: Your language model is secretly a reward model. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
  42. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 2020.
  43. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pages 5628–5637. PMLR, 2019.
  44. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. doi: 10.1109/CVPR.2015.7298682.
  45. Learning to summarize with human feedback. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
  46. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  49. Zephyr: Direct distillation of lm alignment, 2023.
  50. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, 1992.
  51. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2021.
  52. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023.
  53. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
  54. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  55. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  56. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  57. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417, 2024.
  58. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003, 2023.
  59. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
  60. Z. Sharegpt_vicuna_unfiltered, 2024. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
  61. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://www.aclweb.org/anthology/P19-1472.
  62. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  63. Fine-tuning language models from human preferences, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Arka Pal (11 papers)
  2. Deep Karkhanis (4 papers)
  3. Samuel Dooley (27 papers)
  4. Manley Roberts (6 papers)
  5. Siddartha Naidu (4 papers)
  6. Colin White (34 papers)
Citations (93)