New Desiderata for Direct Preference Optimization
Abstract: LLMs in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.
- Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
- Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024.
- A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- C.M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
- Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 14:877–905, 2008.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
- Iteratively reweighted algorithms for compressive sensing. International Conference on Accoustics, Speech, and Signal Processing, 2008.
- Strong NP-hardness for sparse optimization with concave penalty functions. In International Confernece on Machine Learning, 2017.
- Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, pages 1135–1144. PMLR, 2018.
- Variable selection via nonconcave penalized likelihood and its oracle properties. JASTA, 96(456):1348–1360, 2001.
- Towards analyzing and understanding the limitations of dpo: A theoretical perspective. arXiv preprint arXiv:2404.04626, 2024.
- Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024.
- A review of quasi-convex functions. Operations research, 19(7):1553–1570, 1971.
- Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2024.
- Understanding the learning dynamics of alignment with human feedback. arXiv preprint arXiv:2403.18742, 2024.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584v2, 2024.
- Gpt-4 technical report, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
- Jason Palmer. Relative convexity. UC San Diego Technical Report, 2003.
- Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
- Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Processing, 51(3):760–770, March 2003.
- Practical and consistent estimation of f-divergences. Advances in Neural Information Processing Systems, 32, 2019.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs, 2009.
- Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. International Conference on Learning Representations, 2024.
- Iterative reweighted â„“1subscriptâ„“1\ell_{1}roman_â„“ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and â„“2subscriptâ„“2\ell_{2}roman_â„“ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT methods for finding sparse solutions. Journal of Selected Topics in Signal Processing (Special Issue on Compressive Sensing), 4(2), 2010.
- Revisiting Bayesian blind deconvolution. Journal of Machine Learning Research (JMLR), 2014.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.