Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

New Desiderata for Direct Preference Optimization (2407.09072v1)

Published 12 Jul 2024 in cs.CL
New Desiderata for Direct Preference Optimization

Abstract: LLMs in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.

New Desiderata for Direct Preference Optimization

The paper "New Desiderata for Direct Preference Optimization" addresses significant gaps in current methodologies for fine-tuning LLMs to align better with human preferences. Traditionally, these methods have relied on Reinforcement Learning with Human Feedback (RLHF), which involves training a reward model that reflects human inclinations and subsequently fine-tuning the policy to balance reward maximization with proximity to a pre-trained reference model. However, inherent instabilities and complexities in RLHF have led to the emergence of Direct Preference Optimization (DPO) techniques, which sidestep the need for a separate reward model by minimizing a single closed-form training objective.

Key Contributions

The key contributions of this paper are multifaceted and provide a thorough examination of the limitations and potential improvements of existing DPO methods. The authors highlight several new evaluation criteria designed to expose enduring weaknesses in DPO methods, including issues with interpolation between a pre-trained reference model and empirical human preferences, and challenges in balancing the regularization of low- and high-quality responses.

  1. Evaluation Criteria and Shortcomings:
    • The authors introduce new evaluation criteria that elucidate the limitations of current DPO methods. For instance, they reveal that most existing DPO methods fail to adequately interpolate between a reference model and human preferences, especially in scenarios where performance should be selectively preserved in regions where the reference model excels.
    • These shortcomings are linked to the uniform regularization effects of commonly used DPO objectives, which do not account for varying performance across different regions of the input space.
  2. Constraints and Reparameterizations:
    • The paper proves that once learning constraints (e.g., early-stopping, weight decay) are introduced, the core reparameterizations underlying certain DPO models no longer hold. This observation drives the need for alternative justifications based solely on the properties of the final loss functions without relying on constraint-dependent reparameterizations.
  3. New Preference Optimization Loss:
    • Motivated by the shortcomings of existing models, the authors propose a new loss function, TYPO\ell_{\text{TYPO}}, designed to satisfy their evaluation desiderata while avoiding dependency on reparameterizations affected by constraints.
    • This new loss aims to balance proximity to a pre-trained reference policy with human preferences more effectively, providing a smoother and more nuanced interpolation between these objectives.

Theoretical and Practical Implications

In theoretical terms, the paper offers substantial insights into the mechanics of DPO methods, elaborating on the inflexibility of current models to selectively preserve strong performance in areas where the reference model is already optimal.

The practical implications are broad and significant for the future of AI and LLM development:

  • Enhanced Model Alignment: By addressing critical shortcomings in preference optimization methods, this research offers a pathway to develop LLMs that better meet human expectations, thus making interactions with AI systems more intuitive and satisfactory.
  • Constraint Integration: The insights into how learning constraints affect preference optimization models provide valuable guidelines for designing robust training procedures that maintain model efficacy even under practical constraints such as limited computational resources or stringent regularization requirements.

Future Developments

Speculating on future developments, the proposed TYPO\ell_{\text{TYPO}} loss function could serve as a foundation for more advanced DPO frameworks, potentially sparking new lines of research focused on refining preference optimization through adaptive mechanisms that account for data variability and usage constraints.

Additionally, the methods and insights discussed in the paper could extend beyond text-based LLMs to other domains such as image and speech processing, where alignment with human preferences is equally critical. The emphasis on empirical validation and theoretical soundness could lead to more generalizable models and frameworks, facilitating the broader adoption of preference-aware optimization in various AI applications.

In conclusion, this paper contributes significantly to the ongoing development of LLMs by addressing existent gaps in preference optimization methodologies. It offers a well-rounded perspective that combines theoretical rigor with practical considerations, paving the way for more nuanced and human-aligned AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
  2. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024.
  3. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  7. C.M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  9. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  10. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 14:877–905, 2008.
  11. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  12. Iteratively reweighted algorithms for compressive sensing. International Conference on Accoustics, Speech, and Signal Processing, 2008.
  13. Strong NP-hardness for sparse optimization with concave penalty functions. In International Confernece on Machine Learning, 2017.
  14. Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, pages 1135–1144. PMLR, 2018.
  15. Variable selection via nonconcave penalized likelihood and its oracle properties. JASTA, 96(456):1348–1360, 2001.
  16. Towards analyzing and understanding the limitations of dpo: A theoretical perspective. arXiv preprint arXiv:2404.04626, 2024.
  17. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770, 2023.
  18. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  19. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024.
  20. A review of quasi-convex functions. Operations research, 19(7):1553–1570, 1971.
  21. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2024.
  22. Understanding the learning dynamics of alignment with human feedback. arXiv preprint arXiv:2403.18742, 2024.
  23. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  24. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  25. Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584v2, 2024.
  26. Gpt-4 technical report, 2024.
  27. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  28. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
  29. Jason Palmer. Relative convexity. UC San Diego Technical Report, 2003.
  30. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
  31. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  32. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  33. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  34. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
  35. Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Processing, 51(3):760–770, March 2003.
  36. Practical and consistent estimation of f-divergences. Advances in Neural Information Processing Systems, 32, 2019.
  37. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  38. Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs, 2009.
  39. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
  40. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
  41. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. International Conference on Learning Representations, 2024.
  42. Iterative reweighted ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT methods for finding sparse solutions. Journal of Selected Topics in Signal Processing (Special Issue on Compressive Sensing), 4(2), 2010.
  43. Revisiting Bayesian blind deconvolution. Journal of Machine Learning Research (JMLR), 2014.
  44. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  45. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  46. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xiangkun Hu (19 papers)
  2. Tong He (124 papers)
  3. David Wipf (59 papers)
Citations (2)