Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
11 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF (2312.08358v2)

Published 13 Dec 2023 in cs.LG, cs.AI, and stat.ML

Abstract: In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. APRIL: Active Preference-learning based Reinforcement Learning, August 2012. URL http://arxiv.org/abs/1208.0984. arXiv:1208.0984 [cs].
  2. A General Language Assistant as a Laboratory for Alignment, December 2021. URL http://arxiv.org/abs/2112.00861. arXiv:2112.00861 [cs].
  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022a. URL http://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs].
  4. Constitutional AI: Harmlessness from AI Feedback, December 2022b. URL http://arxiv.org/abs/2212.08073. arXiv:2212.08073 [cs].
  5. Which Examples Should be Multiply Annotated? Active Learning When Annotators May Disagree. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  10352–10371, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-acl.658.
  6. LESS is More: Rethinking Probabilistic Models of Human Behavior. Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp.  429–437, March 2020. doi: 10.1145/3319502.3374811. URL https://dl.acm.org/doi/10.1145/3319502.3374811. Conference Name: HRI ’20: ACM/IEEE International Conference on Human-Robot Interaction ISBN: 9781450367462 Place: Cambridge United Kingdom Publisher: ACM.
  7. A Survey of Preference-Based Online Learning with Bandit Algorithms. In Peter Auer, Alexander Clark, Thomas Zeugmann, and Sandra Zilles (eds.), Algorithmic Learning Theory, Lecture Notes in Computer Science, pp.  18–39, Cham, 2014. Springer International Publishing. ISBN 978-3-319-11662-4. doi: 10.1007/978-3-319-11662-4˙3.
  8. Efficient Bayesian Inference for Generalized Bradley-Terry Models, November 2010. URL http://arxiv.org/abs/1011.1761. arXiv:1011.1761 [stat].
  9. Recovering Preferences From Finite Data. Econometrica, 89(4):1633–1664, 2021. ISSN 1468-0262. doi: 10.3982/ECTA17845. URL https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA17845. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.3982/ECTA17845.
  10. Spectral MLE: Top-$K$ Rank Aggregation from Pairwise Comparisons, May 2015. URL http://arxiv.org/abs/1504.07218. arXiv:1504.07218 [cs, math, stat].
  11. Deep reinforcement learning from human preferences, June 2017. URL http://arxiv.org/abs/1706.03741. arXiv:1706.03741 [cs, stat].
  12. Common Voting Rules as Maximum Likelihood Estimators. In UAI ’05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005, pp.  145–152. AUAI Press, 2005. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=1213&proceeding_id=21.
  13. Safe RLHF: Safe Reinforcement Learning from Human Feedback, October 2023. URL http://arxiv.org/abs/2310.12773. arXiv:2310.12773 [cs].
  14. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback, August 2023. URL http://arxiv.org/abs/2305.14387. arXiv:2305.14387 [cs].
  15. Michael Dummett. The Borda count and agenda manipulation. Social Choice and Welfare, 15(2):289–296, 1998. ISSN 0176-1714. URL https://www.jstor.org/stable/41106256. Publisher: Springer.
  16. Incentive Compatible Active Learning, November 2019. URL http://arxiv.org/abs/1911.05171. arXiv:1911.05171 [cs].
  17. Peter Emerson. The original Borda count and partial voting. Social Choice and Welfare, 40(2):353–358, February 2013. ISSN 0176-1714, 1432-217X. doi: 10.1007/s00355-011-0603-9. URL http://link.springer.com/10.1007/s00355-011-0603-9.
  18. Multi-Principal Assistance Games, July 2020. URL http://arxiv.org/abs/2007.09540. arXiv:2007.09540 [cs].
  19. Generative Social Choice, September 2023. URL http://arxiv.org/abs/2309.01291. arXiv:2309.01291 [cs].
  20. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks. 2023. doi: 10.48550/ARXIV.2305.06626. URL https://arxiv.org/abs/2305.06626. Publisher: arXiv Version Number: 3.
  21. John C. Handley. Comparative analysis of Bradley-Terry and Thurstone-Mosteller paired comparison models for image quality assessment. In PICS, volume 1, pp.  108–112, 2001.
  22. Approximate Ranking from Pairwise Comparisons, January 2018. URL http://arxiv.org/abs/1801.01253. arXiv:1801.01253 [cs, math, stat].
  23. Minimax Rate for Learning From Pairwise Comparisons in the BTL Model. In Proceedings of the 37th International Conference on Machine Learning, pp.  4193–4202. PMLR, November 2020. URL https://proceedings.mlr.press/v119/hendrickx20a.html. ISSN: 2640-3498.
  24. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. URL http://arxiv.org/abs/2106.09685. arXiv:2106.09685 [cs].
  25. Reward-Rational (Implicit) Choice: A Unifying Formalism for Reward Learning. arXiv:2002.04833 [cs], December 2020. URL http://arxiv.org/abs/2002.04833. arXiv: 2002.04833.
  26. Embedding Democratic Values into Social Media AIs via Societal Objective Functions, July 2023. URL http://arxiv.org/abs/2307.13912. arXiv:2307.13912 [cs].
  27. Paul E. Johnson. Voting systems. University of Kansas, Department of Mathematics, 2005. URL https://pj.freefaculty.org/Ukraine/PJ3_VotingSystemsEssay.pdf.
  28. Models of human preference for learning reward functions, June 2022. URL http://arxiv.org/abs/2206.02231. arXiv:2206.02231 [cs, eess].
  29. The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models, April 2022. URL http://arxiv.org/abs/2204.10759. arXiv:2204.10759 [cs].
  30. Uncertain decisions facilitate better preference learning. Advances in Neural Information Processing Systems, 34:15070–15083, 2021.
  31. Entangled Preferences: The History and Risks of Reinforcement Learning and Human Feedback, October 2023. URL http://arxiv.org/abs/2310.13595. arXiv:2310.13595 [cs].
  32. B-Pref: Benchmarking Preference-Based Reinforcement Learning, November 2021. URL http://arxiv.org/abs/2111.03026. arXiv:2111.03026 [cs].
  33. David Lippman. Math in Society. CreateSpace Independent Publishing Platform, September 2012. ISBN 978-1-4792-7653-0.
  34. Decoupled Weight Decay Regularization, January 2019. URL http://arxiv.org/abs/1711.05101. arXiv:1711.05101 [cs, math].
  35. Fast and Accurate Inference of Plackett– Luce Models. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://papers.nips.cc/paper_files/paper/2015/hash/2a38a4a9316c49e5a833517c45d31070-Abstract.html.
  36. Abhilash Mishra. AI Alignment and Social Choice: Fundamental Limitations and Policy Implications, October 2023. URL http://arxiv.org/abs/2310.16048. arXiv:2310.16048 [cs].
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  38. Dueling RL: Reinforcement Learning with Trajectory Preferences, November 2021. URL http://arxiv.org/abs/2111.04850. arXiv:2111.04850 [cs].
  39. PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
  40. A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data. In Proceedings of the 31st International Conference on Machine Learning, pp.  118–126. PMLR, January 2014. URL https://proceedings.mlr.press/v32/rajkumar14.html. ISSN: 1938-7228.
  41. Active Preference-Based Learning of Reward Functions. In Robotics: Science and Systems XIII. Robotics: Science and Systems Foundation, July 2017. ISBN 978-0-9923747-3-0. doi: 10.15607/RSS.2017.XIII.053. URL http://www.roboticsproceedings.org/rss13/p53.pdf.
  42. Simple, Robust and Optimal Ranking from Pairwise Comparisons. Journal of Machine Learning Research, 18(199):1–38, 2018. ISSN 1533-7928. URL http://jmlr.org/papers/v18/16-206.html.
  43. Estimation from Pairwise Comparisons: Sharp Minimax Bounds with Topology Dependence, May 2015. URL http://arxiv.org/abs/1505.01462. arXiv:1505.01462 [cs, math, stat].
  44. Invariance in Policy Optimisation and Partial Identifiability in Reward Learning. In Proceedings of the 40th International Conference on Machine Learning, pp.  32033–32058. PMLR, July 2023. URL https://proceedings.mlr.press/v202/skalse23a.html. ISSN: 2640-3498.
  45. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pp.  3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html.
  46. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
  47. Jailbroken: How Does LLM Safety Training Fail?, July 2023. URL http://arxiv.org/abs/2307.02483. arXiv:2307.02483 [cs].
  48. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020. URL http://arxiv.org/abs/1910.03771. arXiv:1910.03771 [cs].
  49. Learning Mixtures of Plackett-Luce Models, March 2020. URL http://arxiv.org/abs/1603.07323. arXiv:1603.07323 [cs].
  50. Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons, May 2023. URL http://arxiv.org/abs/2301.11270. arXiv:2301.11270 [cs, math, stat].
  51. Consequences of Misaligned AI. In Advances in Neural Information Processing Systems, volume 33, pp.  15763–15773. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/b607ba543ad05417b8507ee86c54fcb7-Abstract.html.
  52. Fine-Tuning Language Models from Human Preferences, January 2020. URL http://arxiv.org/abs/1909.08593. arXiv:1909.08593 [cs, stat].
Citations (33)

Summary

We haven't generated a summary for this paper yet.