$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization (2410.21662v2)
Abstract: Preference optimization has made significant progress recently, with numerous methods developed to align LLMs with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art LLMs using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of LLM alignment. Code is available at https://github.com/MinkaiXu/fPO.
- A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131โ142.
- A general language assistant as a laboratory for alignment. ArXiv preprint, abs/2112.00861.
- A general theoretical paradigm to understand learning from human preferences. ArXiv, abs/2310.12036.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610โ623.
- Pythia: A suite for analyzing large language models across training and scaling.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324โ345.
- Deep reinforcement learning from human preferences. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.ย M., Fergus, R., Vishwanathan, S. V.ย N., and Garnett, R., editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299โ4307.
- Information theory and statistics: A tutorial. Foundations and Trendsยฎ in Communications and Information Theory, 1(4):417โ528.
- Ultrafeedback: Boosting language models with high-quality feedback.
- Enhancing chat language models by scaling high-quality instructional conversations.
- Enhancing chat language models by scaling high-quality instructional conversations. In EMNLP.
- The llama 3 herd of models.
- Length-controlled AlpacaEval: A simple way to debias automatic evaluators. ArXiv, abs/2404.04475.
- KTO: Model alignment as prospect theoretic optimization. ArXiv, abs/2402.01306.
- ORPO: Monolithic preference optimization without reference model. ArXiv, abs/2403.07691.
- Towards efficient and exact optimization of language model alignment. arXiv preprint arXiv:2402.00856.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1โ38.
- Mistral 7b.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Scalable agent alignment via reward modeling: a research direction. ArXiv preprint, abs/1811.07871.
- From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.
- AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394โ4412.
- Lipo: Listwise preference optimization through learning-to-rank.
- Statistical rejection sampling improves preference optimization.
- Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
- On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21(1):10โ13.
- f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730โ27744.
- Disentangling length from quality in direct preference optimization. ArXiv, abs/2403.19159.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745โ750.
- Direct preference optimization: Your language model is secretly a reward model. ArXiv preprint, abs/2305.18290.
- Proximal policy optimization algorithms.
- Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347.
- Preference ranking optimization for human alignment.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008โ3021.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749.
- TL;DR: Mining Reddit to learn automatic summarization. In Wang, L., Cheung, J. C.ย K., Carenini, G., and Liu, F., editors, Proceedings of the Workshop on New Frontiers in Summarization, pages 59โ63, Copenhagen, Denmark. Association for Computational Linguistics.
- Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240.
- Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. ArXiv, abs/2401.08417.
- RRHF: Rank responses to align language models with human feedback. In NeurIPS.
- Rrhf: Rank responses to align language models with human feedback without tears.
- SLiC-HF: Sequence likelihood calibration with human feedback. ArXiv, abs/2305.10425.
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS Datasets and Benchmarks Track.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.