Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MallowsPO: Fine-Tune Your LLM with Preference Dispersions (2405.14953v3)

Published 23 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune LLMs (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the MallowsPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with MallowsPO. More importantly, we demonstrate (empirically) how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generations and dialogues, while maintaining great generalization capabilities. MallowsPO is also compatible with other SOTA offline preference optimization methods, boosting nearly 2\% extra LC win rate when used as a plugin for fine-tuning Llama3-Instruct.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. GPT-4 technical report. 2023. arXiv:2303.08774.
  2. A general theoretical paradigm to understand learning from human preferences. In AISTATS, pages 4447–4455, 2024.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022. arXiv:2204.05862.
  4. Constitutional AI: Harmlessness from AI feedback. 2022. arXiv:2212.08073.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  6. Language models are few-shot learners. In Neurips, volume 33, pages 1877–1901, 2020.
  7. Preference-based rank elicitation using statistical models: The case of Mallows. In ICML, pages 1071–1079, 2014.
  8. Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024.
  9. Distribution aware metrics for conditional natural language generation. 2022. arXiv:2209.07518.
  10. Self-play fine-tuning converts weak language models to strong language models. 2024. arXiv:2401.01335.
  11. Douglas Critchlow. Metric methods for analyzing partially ranked data, volume 34. Lecture notes in Statistics, Springer, 1985.
  12. Persi Diaconis. Group representations in probability and statistics, volume 11. Lecture Notes-Monograph Series, 1988.
  13. Persi Diaconis. A generalization of spectral analysis with application to ranked data. Ann. Stat., pages 949–979, 1989.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback. In Neurips, volume 36, 2024.
  15. Human-aware loss functions (halos). Technical report, Contextual AI, 2023. https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf.
  16. There are a thousand hamlets in a thousand people’s eyes: Enhancing knowledge-grounded dialogue with personal memory. 2022. arXiv:2204.02624.
  17. Predictive entropy search for efficient global optimization of black-box functions. In NIPS, volume 27, pages 918––926, 2014.
  18. Camels in a changing climate: Enhancing LM adaptation with Tulu 2. 2023. arXiv:2311.10702.
  19. Understanding the effects of RLHF on LLM generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.
  20. Rlaif: Scaling reinforcement learning from human feedback with AI feedback. 2023. arXiv:2309.00267.
  21. Deep reinforcement learning for dialogue generation. 2016. arXiv:1606.01541.
  22. Learning word vectors for sentiment analysis. In ACL, pages 142–150, 2011.
  23. David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
  24. Colin L Mallows. Non-null ranking models. I. Biometrika, 44(1/2):114–130, 1957.
  25. Learning mixtures of permutations: groups of pairwise comparisons and combinatorial method of moments. Ann. Statist., 50(4):2231–2255, 2022.
  26. Marina Meila and Le Bao. An exponential model for infinite rankings. J. Mach. Learn. Res., 11:3481–3518, 2010.
  27. Nash learning from human feedback. 2023. arXiv:2312.00886.
  28. Training language models to follow instructions with human feedback. In Neurips, volume 35, pages 27730–27744, 2022.
  29. Regenerative random permutations of integers. Ann. Probab., 47(3):1378–1416, 2019.
  30. Direct preference optimization: Your language model is secretly a reward model. In Neurips, volume 36, 2023.
  31. Proximal policy optimization algorithms. 2017. arXiv:1707.06347.
  32. Preference ranking optimization for human alignment. In AAAI, volume 38, pages 18990–18998, 2024.
  33. Learning to summarize with human feedback. In Neurips, volume 33, pages 3008–3021, 2020.
  34. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
  35. Wenpin Tang. Mallows ranking models: maximum likelihood estimate and regeneration. In ICML, pages 6125–6134, 2019.
  36. Generalized preference optimization: A unified approach to offline alignment. 2024. arXiv:2402.05749.
  37. Zephyr: Direct distillation of LM alignment. 2023. arXiv:2310.16944.
  38. Secrets of rlhf in large language models part II: Reward modeling. 2024. arXiv:2401.06080.
  39. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. 2023. arXiv:2309.16240.
  40. Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. 2024. arXiv:2401.08417.
  41. Investigating the catastrophic forgetting in multimodal large language models. 2023. arXiv:2309.10313.
  42. Slic-hf: Sequence likelihood calibration with human feedback. 2023. arXiv:2305.10425.
  43. Secrets of RLHF in large language models part I: PPO. 2023. arXiv:2307.04964.
  44. Fine-tuning language models from human preferences. 2019. arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haoxian Chen (15 papers)
  2. Hanyang Zhao (12 papers)
  3. Henry Lam (91 papers)
  4. David Yao (2 papers)
  5. Wenpin Tang (58 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets