Papers
Topics
Authors
Recent
2000 character limit reached

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization (2410.21662v2)

Published 29 Oct 2024 in cs.CL and cs.LG

Abstract: Preference optimization has made significant progress recently, with numerous methods developed to align LLMs with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art LLMs using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of LLM alignment. Code is available at https://github.com/MinkaiXu/fPO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131โ€“142.
  2. A general language assistant as a laboratory for alignment. ArXiv preprint, abs/2112.00861.
  3. A general theoretical paradigm to understand learning from human preferences. ArXiv, abs/2310.12036.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610โ€“623.
  6. Pythia: A suite for analyzing large language models across training and scaling.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324โ€“345.
  8. Deep reinforcement learning from human preferences. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.ย M., Fergus, R., Vishwanathan, S. V.ย N., and Garnett, R., editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299โ€“4307.
  9. Information theory and statistics: A tutorial. Foundations and Trendsยฎ in Communications and Information Theory, 1(4):417โ€“528.
  10. Ultrafeedback: Boosting language models with high-quality feedback.
  11. Enhancing chat language models by scaling high-quality instructional conversations.
  12. Enhancing chat language models by scaling high-quality instructional conversations. In EMNLP.
  13. The llama 3 herd of models.
  14. Length-controlled AlpacaEval: A simple way to debias automatic evaluators. ArXiv, abs/2404.04475.
  15. KTO: Model alignment as prospect theoretic optimization. ArXiv, abs/2402.01306.
  16. ORPO: Monolithic preference optimization without reference model. ArXiv, abs/2403.07691.
  17. Towards efficient and exact optimization of language model alignment. arXiv preprint arXiv:2402.00856.
  18. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1โ€“38.
  19. Mistral 7b.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  21. Scalable agent alignment via reward modeling: a research direction. ArXiv preprint, abs/1811.07871.
  22. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.
  23. AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  24. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394โ€“4412.
  25. Lipo: Listwise preference optimization through learning-to-rank.
  26. Statistical rejection sampling improves preference optimization.
  27. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
  28. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21(1):10โ€“13.
  29. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730โ€“27744.
  31. Disentangling length from quality in direct preference optimization. ArXiv, abs/2403.19159.
  32. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745โ€“750.
  33. Direct preference optimization: Your language model is secretly a reward model. ArXiv preprint, abs/2305.18290.
  34. Proximal policy optimization algorithms.
  35. Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347.
  36. Preference ranking optimization for human alignment.
  37. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008โ€“3021.
  38. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749.
  39. TL;DR: Mining Reddit to learn automatic summarization. In Wang, L., Cheung, J. C.ย K., Carenini, G., and Liu, F., editors, Proceedings of the Workshop on New Frontiers in Summarization, pages 59โ€“63, Copenhagen, Denmark. Association for Computational Linguistics.
  40. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240.
  41. Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. ArXiv, abs/2401.08417.
  42. RRHF: Rank responses to align language models with human feedback. In NeurIPS.
  43. Rrhf: Rank responses to align language models with human feedback without tears.
  44. SLiC-HF: Sequence likelihood calibration with human feedback. ArXiv, abs/2305.10425.
  45. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS Datasets and Benchmarks Track.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.