Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
43 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

VPO: Leveraging the Number of Votes in Preference Optimization (2410.22891v1)

Published 30 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Direct Preference Optimization (DPO) trains a LLM using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, 4447–4455. PMLR.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, 2397–2430. PMLR.
  4. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409.
  5. UltraFeedback: Boosting Language Models with High-quality Feedback. arXiv:2310.01377.
  6. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387.
  7. Understanding Dataset Difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-Usable Information. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 5988–6008. PMLR.
  8. Human-Aware Loss Functions (HALOs). Technical report, Contextual AI.
  9. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  11. ORPO: Monolithic preference optimization without reference model. ArXiv, abs/2403.07691.
  12. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  13. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca˙eval.
  14. Mitchell, E. 2023. A note on DPO with noisy preferences & relationship to IPO. https://ericmitchell.ai/cdpo.pdf.
  15. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744.
  16. Pishro-Nik, H. 2014. Introduction to Probability, Statistics, and Random Processes. Kappa Research, LLC. ISBN 978-0990637202.
  17. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Thirty-seventh Conference on Neural Information Processing Systems.
  18. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 3008–3021.
  19. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  21. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36: 74764–74786.
  22. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets