Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LiPO: Listwise Preference Optimization through Learning-to-Rank (2402.01878v2)

Published 2 Feb 2024 in cs.CL and cs.LG
LiPO: Listwise Preference Optimization through Learning-to-Rank

Abstract: Aligning LLMs (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$\lambda$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$\lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

Introduction

LLMs such as GPT-4 and Gemini have shown their prowess across a breadth of tasks, from casual conversational roles to complex coding problems. To employ these models viably in everyday applications, however, one must align them with human values and preferences—a process termed 'LM alignment'. Traditional reinforcement learning techniques for this task are notoriously complex and resource-intensive. The paper "LiPO: Listwise Preference Optimization through Learning-to-Rank" proposes an alternative that treats LM alignment as a Learning-to-Rank (LTR) problem, aiming to leverage the efficiency of ranking-based methods over traditional ones in optimizing LLMs according to human feedback.

The LiPO Framework

This work critiques that prevalent preference optimization methods scarcely go beyond pairwise alignments, which may be inadequate given that human feedback often takes the form of a ranked list. Responding to this, the authors devise the Listwise Preference Optimization (LiPO) framework, which poses LM alignment as a listwise ranking challenge. This framework not only generalizes existing models but allows for the exploration of more listwise objectives.

Under LiPO, previous alignment methods can be understood as specific cases of ranking optimizations. For instance, while earlier models like DPO and SLiC reduce to pairwise ranking problems, LiPO proposes listwise objectives that could better capture the essence of human rankings. What is particularly noteworthy in this paper is the introduction of LiPO-λ—a new method employing a sophisticated and theoretically grounded listwise ranking objective which has shown improved performance across evaluation tasks over its counterparts.

Advantages of Listwise Ranking

LiPO's advantage lies in its listwise perspective. Where traditional methods may consider response pairs in isolation, LiPO-λ infers from entire lists of responses, arguably a more holistic approach. Additionally, LiPO-λ innovatively incorporates label values within its optimization—a crucial detail that earlier methods ignored. In doing so, it understands the graded spectrum of quality, thereby making more informed alignment decisions. Empirically, through various experiments on the Reddit TL;DR and AnthropicHH datasets, LiPO-λ outperformed existing methods such as DPO and SLiC by clear margins, and its benefits intensified as the size of the response lists increased.

Evaluation and Applications

The evaluations using three distinct approaches: Proxy Reward Model, AutoSxS, and Human Evaluation, all converge to affirm LiPO-λ's strengths. Proxy Reward Model, for instance, found LiPO-λ's generated responses aligning more closely with the SFT target than other models it was pitted against. Moreover, its scalability with larger LM policies suggests wider applicability to various natural language processing tasks.

Concluding Remarks

Listwise Preference Optimization (LiPO) brings forth a nuanced approach to aligning LMs with human preferences. Its innovative incorporation of Learning-to-Rank techniques not only simplifies but also enhances the alignment process. The superior results of LiPO-λ substantiate its potential as a powerful tool in refining LLMs for real-world deployment, hailing in a new, more efficient phase for model alignment techniques. Future work beckons with numerous possibilities, from in-depth theoretical analyses of LambdaLoss's effectiveness in LM alignment to considering online learning strategies to further reduce distribution shift issues.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Regression compatible listwise objectives for calibrated ranking with binary relevance. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp.  4502–4508, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  4. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pp.  89–96, 2005.
  5. Learning to rank with nonsmooth cost functions. In Schölkopf, B., Platt, J., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.
  6. Burges, C. J. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581):81, 2010.
  7. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp.  129–136, 2007.
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  9. On the local optimality of lambdarank. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, pp. 460–467, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605584836. doi: 10.1145/1571941.1572021. URL https://doi.org/10.1145/1571941.1572021.
  10. PaLM 2 technical report, 2023.
  11. On optimizing top-k metrics for neural ranking models. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  2303–2307, 2022a.
  12. Rax: Composable learning-to-rank using jax. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3051–3060, 2022b.
  13. Joachims, T. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  133–142, 2002.
  14. Unbiased learning-to-rank with biased feedback. In Proceedings of the tenth ACM international conference on web search and data mining, pp.  781–789, 2017.
  15. Learning to rank for recommender systems. In Proceedings of the 7th ACM Conference on Recommender Systems, pp.  493–494, 2013.
  16. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  17. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  18. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  19. Liu, T.-Y. Learning to rank for information retrieval. Found. Trends Inf. Retr., 2009.
  20. Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
  21. OpenAI. Gpt-4 technical report, 2023.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  23. Plackett, R. L. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  24. Are neural rankers still outperformed by gradient boosted decision trees? In International Conference on Learning Representations, 2020.
  25. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563, 2023.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  28. Simple, robust and optimal ranking from pairwise comparisons. Journal of machine learning research, 18(199):1–38, 2018.
  29. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, pp.  4596–4604, 2018.
  30. Rewritelm: An instruction-tuned large language model for text rewriting. arXiv preprint arXiv:2305.15685, 2023.
  31. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
  32. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  33. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  34. The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM international conference on information and knowledge management, pp.  1313–1322, 2018.
  35. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pp.  25–54. PMLR, 2013.
  36. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning, pp.  1192–1199, 2008.
  37. Learning to rank using user clicks and visual features for image retrieval. IEEE Transactions on Cybernetics, 45(4):767–779, 2015. doi: 10.1109/TCYB.2014.2336697.
  38. RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  39. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Tianqi Liu (49 papers)
  2. Zhen Qin (105 papers)
  3. Junru Wu (23 papers)
  4. Jiaming Shen (56 papers)
  5. Misha Khalman (9 papers)
  6. Rishabh Joshi (23 papers)
  7. Yao Zhao (272 papers)
  8. Mohammad Saleh (19 papers)
  9. Simon Baumgartner (10 papers)
  10. Jialu Liu (21 papers)
  11. Peter J. Liu (30 papers)
  12. Xuanhui Wang (36 papers)
Citations (38)