Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization (2405.16681v1)

Published 26 May 2024 in cs.CL

Abstract: LLMs perform well across diverse tasks, but aligning them with human demonstrations is challenging. Recently, Reinforcement Learning (RL)-free methods like Direct Preference Optimization (DPO) have emerged, offering improved stability and scalability while retaining competitive performance relative to RL-based methods. However, while RL-free methods deliver satisfactory performance, they require significant data to develop a robust Supervised Fine-Tuned (SFT) model and an additional step to fine-tune this model on a preference dataset, which constrains their utility and scalability. In this paper, we introduce Triple Preference Optimization (TPO), a new preference learning method designed to align an LLM with three preferences without requiring a separate SFT step and using considerably less data. Through a combination of practical experiments and theoretical analysis, we show the efficacy of TPO as a single-step alignment strategy. Specifically, we fine-tuned the Phi-2 (2.7B) and Mistral (7B) models using TPO directly on the UltraFeedback dataset, achieving superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO. Moreover, the performance of TPO without the SFT component led to notable improvements in the MT-Bench score, with increases of +1.27 and +0.63 over SFT and DPO, respectively. Additionally, TPO showed higher average accuracy, surpassing DPO and SFT by 4.2% and 4.97% on the Open LLM Leaderboard benchmarks. Our code is publicly available at https://github.com/sahsaeedi/triple-preference-optimization .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Palm 2 technical report.
  2. A general theoretical paradigm to understand learning from human preferences.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  4. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  5. Heejong Bong and Alessandro Rinaldo. 2022. Generalized results for the existence and consistency of the mle in the bradley-terry-luce model.
  6. Language models are few-shot learners.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4.
  8. Getting it right: Improving spatial consistency in text-to-image models. arXiv preprint arXiv:2404.01197.
  9. Deep reinforcement learning from human preferences.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  11. Ultrafeedback: Boosting language models with high-quality feedback.
  12. Enhancing chat language models by scaling high-quality instructional conversations.
  13. Human-aware loss functions (halos). Technical report, Contextual AI.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
  15. Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639.
  16. Measuring massive multitask language understanding.
  17. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
  18. Lora: Low-rank adaptation of large language models.
  19. Phi-2: The surprising power of small language models. Microsoft Research Blog.
  20. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning.
  21. Truthfulqa: Measuring how models mimic human falsehoods.
  22. Statistical rejection sampling improves preference optimization.
  23. Alexander V. Lotov and Kaisa Miettinen. 2008. Visualizing the Pareto Frontier, pages 213–243. Springer Berlin Heidelberg, Berlin, Heidelberg.
  24. Kaisa Miettinen. 1999. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media.
  25. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  26. Efficient large-scale language model training on gpu clusters using megatron-lm.
  27. Training language models to follow instructions with human feedback.
  28. Red teaming language models with language models.
  29. Direct preference optimization: Your language model is secretly a reward model.
  30. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization.
  31. Insights into alignment: Evaluating dpo and its variants across multiple tasks. arXiv preprint arXiv:2404.14723.
  32. Winogrande: An adversarial winograd schema challenge at scale.
  33. Multitask prompted training enables zero-shot task generalization.
  34. Proximal policy optimization algorithms.
  35. Learning to summarize from human feedback.
  36. Llama: Open and efficient foundation language models.
  37. Zephyr: Direct distillation of lm alignment.
  38. Amos Tversky and Daniel Kahneman. 1992. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297–323.
  39. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  40. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment.
  41. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.
  42. Rrhf: Rank responses to align language models with human feedback without tears.
  43. Hellaswag: Can a machine really finish your sentence?
  44. Slic-hf: Sequence likelihood calibration with human feedback.
  45. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Amir Saeidi (8 papers)
  2. Shivanshu Verma (2 papers)
  3. Aswin RRV (5 papers)
  4. Chitta Baral (152 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com