Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences (2402.08925v1)

Published 14 Feb 2024 in cs.CL, cs.AI, cs.LG, and cs.RO

Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns LLMs to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale LLMs (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to LLMs but also extend to reinforcement learning in general.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24, Mar. 2015. doi: 10.1609/aimag.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564.
  2. The reasonable effectiveness of diverse evaluation data, 2023a.
  3. Dices dataset: Diversity in conversational ai evaluation for safety, 2023b.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback, 2022b.
  6. Fine-tuning language models to find agreement among humans with diverse preferences, 2022.
  7. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/23359484.
  8. Bertsekas, D. P. Reinforcement Learning and Optimal Control. Athena Scientific, Belmont, MA, 2019. ISBN 978-1-886529-39-7.
  9. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  10. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  11. Parl: A unified framework for policy alignment in reinforcement learning. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
  12. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  13. Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018. URL https://ar5iv.org/abs/1811.00511.
  14. Christian, B. The alignment problem: Machine learning and human values. WW Norton & Company, 2020.
  15. Whose ground truth? accounting for individual and collective identities underlying dataset annotation, 2021a.
  16. Whose ground truth? accounting for individual and collective identities underlying dataset annotation, 2021b.
  17. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  18. Statistics of robust optimization: A generalized empirical likelihood approach, 2018.
  19. Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
  20. Deep Learning. The MIT Press, 2016. ISBN 0262035618.
  21. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  22. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
  23. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023a.
  24. Provable benefits of policy learning from human preferences in contextual bandit problems. arXiv preprint arXiv:2307.12975, 2023b. URL https://ar5iv.org/abs/2307.12975.
  25. A survey of reinforcement learning from human feedback, 2023.
  26. Large language models as superpositions of cultural perspectives, 2023.
  27. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023. URL https://ar5iv.org/abs/2305.18438.
  28. Lindsay, B. G. Mixture models: Theory, geometry and applications. NSF-CBMS Regional Conference Series in Probability and Statistics, 5:i–163, 1995. ISSN 19355920, 23290978. URL http://www.jstor.org/stable/4153184.
  29. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  30. Training language models to follow instructions with human feedback, 2022a.
  31. Training language models to follow instructions with human feedback, 2022b.
  32. Ovadya, A. ’generative ci’ through collective response systems, 2023.
  33. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  34. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  35. Direct preference optimization: Your language model is secretly a reward model, 2023.
  36. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards, 2023.
  37. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, 2020.
  38. Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In Vlachos, A. and Augenstein, I. (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2428–2441, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.178. URL https://aclanthology.org/2023.eacl-main.178.
  39. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
  40. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V. (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5884–5906, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.431. URL https://aclanthology.org/2022.naacl-main.431.
  41. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  42. Seidel, W. Mixture Models, pp.  827–829. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. ISBN 978-3-642-04898-2. doi: 10.1007/978-3-642-04898-2˙368. URL https://doi.org/10.1007/978-3-642-04898-2_368.
  43. Sen, A. Collective Choice and Social Welfare. Harvard University Press, Cambridge, MA and London, England, 2017. ISBN 9780674974616. doi: doi:10.4159/9780674974616. URL https://doi.org/10.4159/9780674974616.
  44. Learning to summarize from human feedback, 2022a.
  45. Learning to summarize from human feedback, 2022b.
  46. Learning to summarize from human feedback, 2022c.
  47. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 2 edition, 2018. ISBN 978-0262039246.
  48. Vogels, E. A. The state of online harassment. Pew Research Center, 13:625, 2021.
  49. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
  50. Variational policy gradient method for reinforcement learning with general utilities, 2020.
  51. On the convergence and sample efficiency of variance-reduced policy gradient method, 2021.
  52. Unified off-policy learning to rank: a reinforcement learning perspective. arXiv preprint arXiv:2306.07528, 2023. URL https://ar5iv.org/abs/2306.07528.
  53. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons, 2023.
  54. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  55. Fine-tuning language models from human preferences, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Souradip Chakraborty (36 papers)
  2. Jiahao Qiu (24 papers)
  3. Hui Yuan (71 papers)
  4. Alec Koppel (72 papers)
  5. Furong Huang (150 papers)
  6. Dinesh Manocha (366 papers)
  7. Amrit Singh Bedi (75 papers)
  8. Mengdi Wang (199 papers)
Citations (48)