Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Direct Preference Optimization with an Offset (2402.10571v2)

Published 16 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning LLMs with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a LLM to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal. Sometimes, the preferred response is only slightly better than the dispreferred one. In other cases, the preference is much stronger. For instance, if a response contains harmful or toxic content, the annotator will have a strong preference for that response. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning LLMs, especially when the number of preference pairs is limited.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. PaLM 2 technical report. Technical report, Google.
  2. A general theoretical paradigm to understand learning from human preferences.
  3. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
  4. Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  5. Human-centered loss functions (HALOs). Technical report, Contextual AI.
  6. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  7. Kevin Gimpel and Noah A. Smith. 2010. Softmax-margin CRFs: Training log-linear models with cost functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 733–736, Los Angeles, California. Association for Computational Linguistics.
  8. Improving alignment of dialogue agents via targeted human judgements.
  9. QUARK: Controllable Text generation with Reinforced Unlearning. In Advances in Neural Information Processing Systems, volume 35, pages 27591–27609. Curran Associates, Inc.
  10. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  11. Gumbel machinery [online]. 2017.
  12. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA. PMLR.
  13. Gpt-4 technical report. Technical report, OpenAI.
  14. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  15. Language models are unsupervised multitask learners.
  16. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc.
  17. Proximal policy optimization algorithms. ArXiv, abs/1707.06347.
  18. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  19. Llama 2: Open foundation and fine-tuned chat models. Technical report, Meta.
  20. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  21. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
  22. Recursively summarizing books with human feedback.
  23. Rrhf: Rank responses to align language models with human feedback without tears.
  24. Slic-hf: Sequence likelihood calibration with human feedback.
  25. Secrets of RLHF in large language models part i: PPO.
  26. Fine-tuning language models from human preferences.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Afra Amini (16 papers)
  2. Tim Vieira (29 papers)
  3. Ryan Cotterell (226 papers)
Citations (38)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets