Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback (2404.10271v2)

Published 16 Apr 2024 in cs.LG, cs.AI, cs.CL, cs.CY, and cs.GT

Abstract: Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, such as helping to commit crimes or producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about "collective" preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Anthropic. Introducing claude, 2023. https://www.anthropic.com/index/introducing-claude, retrieved 2024-01-31.
  2. Arrow, K. J. Social Choice and Individual Values. John Wiley & Sons, Inc., New York, 1st edition, 1951.
  3. Arrow, K. J. Social Choice and Individual Values. Yale University Press, 2012.
  4. Social Choice and Multicriterion Decision-Making. The MIT Press, 1986.
  5. The Moral Machine experiment. Nature, 563(7729):59–64, November 2018. ISSN 1476-4687. doi: 10.1038/s41586-018-0637-6. URL https://doi.org/10.1038/s41586-018-0637-6.
  6. Justified representation in approval-based committee voting. Social Choice and Welfare, 48(2):461–485, 2017.
  7. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  8. Constitutional AI: Harmlessness from AI feedback, 2022b. arXiv:2212.08073.
  9. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
  10. Majority Judgement: Measuring, Ranking and Electing. MIT Press, Boston, 2010. doi: 10.7551/mitpress/9780262015134.001.0001.
  11. Voting procedures. In Arrow, K. J., Sen, A. K., and Suzumura, K. (eds.), Handbook of Social Choice and Welfare, volume 1, pp.  173–236. North-Holland, Amsterdam, 2002. doi: 10.1016/s1574-0110(02)80008-x.
  12. Brandt, F. Rolling the dice: Recent results in probabilistic social choice. In Endriss, U. (ed.), Trends in Computational Social Choice, pp.  3–26. AI Access, 2017.
  13. Handbook of Computational Social Choice. Cambridge University Press, 2015.
  14. Impossibility theorems in the Arrovian framework. In Arrow, K. J., Sen, A. K., and Suzumura, K. (eds.), Handbook of Social Choice and Welfare, volume 1, pp.  35–94. North-Holland, Amsterdam, 2002.
  15. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  16. Suppressing pink elephants with direct principle feedback, 2024.
  17. Computational Aspects of Cooperative Game Theory. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
  18. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  19. Foundations of cooperative AI. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, pp.  15359–15367, Washington, DC, USA, 2023.
  20. Vote elicitation: Complexity and strategy-proofness. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp.  392–397, Edmonton, AB, Canada, 2002.
  21. Communication complexity of common voting rules. In Proceedings of the ACM Conference on Electronic Commerce (EC), pp.  78–87, Vancouver, BC, Canada, 2005.
  22. Using mechanism design to prevent false-name manipulations. AI Magazine, 31(4):65–77, 2010.
  23. Dominating manipulations in voting with partial information. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI-11), pp.  638–643. AAAI Press, 2011.
  24. Cooperative AI: machines must learn to find common ground. Nature, 593(7857):33–36, 2021.
  25. Social welfare functionals and interpersonal comparability. In Arrow, K. J., Sen, A. K., and Suzumura, K. (eds.), Handbook of Social Choice and Welfare, volume 1, pp.  459–541. Elsevier Science B.V., 2002.
  26. Probabilistic opinion pooling. In Hajek, A. and Hitchcock, C. (eds.), Oxford Handbook of Philosophy and Probability. Oxford University Press, Oxford, 2016.
  27. Hard choices in artificial intelligence. Artificial Intelligence, 300:103555, 2021.
  28. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023.
  29. Rationalizations of voting rules. In Brandt, F., Conitzer, V., Endriss, U., Lang, J., and Procaccia, A. D. (eds.), Handbook of Computational Social Choice, chapter 8. Cambridge University Press, 2015.
  30. Properties of multiwinner voting rules. Social Choice and Welfare, 48:599–632, 2017.
  31. Endriss, U. Judgment aggregation. In Brandt, F., Conitzer, V., Endriss, U., Lang, J., and Procaccia, A. D. (eds.), Handbook of Computational Social Choice, chapter 17. Cambridge University Press, 2015.
  32. Human-centered loss functions (halos). Technical report, Contextual AI, 2023. https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf, retrieved 2024-01-31.
  33. Multiwinner voting: A new challenge for social choice theory. Trends in computational social choice, 74(2017):27–47, 2017.
  34. Moral machine or tyranny of the majority? arXiv preprint arXiv:2305.17319, 2023.
  35. Generative social choice. arXiv preprint arXiv:2309.01291, 2023.
  36. Fishburn, P. C. The Theory of Social Choice. Princeton Legacy Library. Princeton University Press, 1973.
  37. Fair algorithms for selecting citizens’ assemblies. Nature, 596:548–552, 2021.
  38. Adapting a kidney exchange algorithm to align with human values. Artificial Intelligence, 283(103261), 2020.
  39. Choice set misspecification in reward inference. arXiv preprint arXiv:2101.07691, 2021.
  40. Active teacher selection for reinforcement learning from human feedback. arXiv preprint arXiv:2310.15288, 2023.
  41. Gabriel, I. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
  42. Ganguli, D. et al. Collective constitutional AI: Aligning a language model with public input. Anthropic, 2023. https://www.anthropic.com/index/collective-constitutional-ai-aligning-a-language-model-with-public-input, retrieved 2024-01-31.
  43. Gibbard, A. Manipulation of voting schemes: a general result. Econometrica, 41:587–601, 1973.
  44. Google. Bard, 2023. https://bard.google.com/, retrieved 2024-01-31.
  45. Embedding ethical principles in collective decision support systems. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp.  4147–4151, Phoenix, AZ, USA, 2016.
  46. Judgment Aggregation: A Primer. Synthesis Lectures on Artificial Intelligence and Machine Learning. Springer Cham, 1 edition, 2022. doi: 10.1007/978-3-031-01568-7.
  47. Representation with incomplete votes. In Williams, B., Chen, Y., and Neville, J. (eds.), Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp.  5657–5664. AAAI Press, 2023. doi: 10.1609/AAAI.V37I5.25702. URL https://doi.org/10.1609/aaai.v37i5.25702.
  48. Strategic voting under uncertainty about the voting method. In Moss, L. S. (ed.), Theoretical Aspects of Rationality and Knowledge: Proceedings of the 2019 Conference (TARK 2019), volume 297 of Electronic Proceedings in Theoretical Computer Science, pp.  252–272. EPTCS, 2019. doi: 10.4204/EPTCS.297.17.
  49. Arrow’s decisive coalitions. Social Choice and Welfare, 54:463–505, 2020. doi: 10.1007/s00355-018-1163-z.
  50. On the sensitivity of reward inference to misspecified human models. arXiv preprint arXiv:2212.04717, 2022.
  51. Language agents as digital representatives in collective decision-making. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  52. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–291, 1979. ISSN 00129682, 14680262.
  53. Kelly, J. S. Social Choice Theory: An Introduction. Springer, Berlin, 1988. doi: 10.1007/978-3-662-09925-4.
  54. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. arXiv preprint arXiv:2303.05453, 2023.
  55. Communication Complexity. Cambridge University Press, 1997.
  56. Multi-Winner Voting with Approval Preferences. SpringerBriefs in Intelligent Systems. Springer Cham, 2023. doi: 10.1007/978-3-031-09016-5.
  57. Composition-consistent tournament solutions and social choice functions. Social Choice and Welfare, 13:75–93, 1996. doi: 10.1007/BF00179100.
  58. The alignment ceiling: Objective mismatch in reinforcement learning from human feedback. arXiv preprint arXiv:2311.00168, 2023.
  59. The history and risks of reinforcement learning and human feedback. arXiv preprint arXiv:2310.13595, 2023.
  60. Citizens’ assemblies, a new form of democratic representation? Participations: Revue de sciences sociales sur la démocratie et la citoyenneté, 34:5–36, 2022.
  61. Lang, J. Vote and aggregation in combinatorial domains with structured preferences. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), pp.  1366–1371, Hyderabad, India, 2007.
  62. Handbook on Approval Voting. Studies in Choice and Welfare. Springer Berlin Heidelberg, 1 edition, 2010. doi: 10.1007/978-3-642-02839-7.
  63. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  64. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021.
  65. Eliciting human preferences with language models. arXiv preprint arXiv:2310.11589, 2023.
  66. Aggregating sets of judgments: An impossibility result. Economics & Philosophy, 18(1):89–110, 2002.
  67. Indecision modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  5975–5983, 2021.
  68. Classics of Social Choice. The University of Michigan Press, Ann Arbor, 1995.
  69. Meng, X. Scalable simple random sampling and stratified sampling. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.
  70. Meta. Meta and microsoft introduce the next generation of llama, 2023. https://about.fb.com/news/2023/07/llama-2/, retrieved 2024-01-31.
  71. Mishra, A. AI alignment and social choice: Fundamental limitations and policy implications. arXiv preprint arXiv:2310.16048, 2023.
  72. More human than human: measuring chatgpt political bias. Public Choice, 198:3–23, 2023.
  73. Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, 23(2):289–297, 1982.
  74. A voting-based system for ethical decision making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  75. Okasha, S. Theory choice and social choice: Kuhn versus Arrow. Mind, 120(477):83–115, 2011.
  76. OpenAI. GPT-4 technical report, 2023.
  77. OpenAI. Democratic inputs to ai grant program: lessons learned and implementation plans, 2024. https://openai.com/blog/democratic-inputs-to-ai-grant-program-update, retrieved 2024-01-31.
  78. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  79. Pacuit, E. Voting methods. In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall 2019 edition, 2019.
  80. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pp.  26837–26867. PMLR, 2023.
  81. Paulin, A. An overview of ten years of liquid democracy research. In Proceedings of the 21st Annual International Conference on Digital Government Research, pp.  116–121, 2020. doi: 10.1145/3396956.3396963.
  82. Cultural bias in explainable ai research: A systematic analysis. J. Artif. Int. Res., 79, mar 2024. ISSN 1076-9757. doi: 10.1613/jair.1.14888. URL https://doi.org/10.1613/jair.1.14888.
  83. Pivato, M. Epistemic democracy with correlated voters. Journal of Mathematical Economics, 72:51–69, 2017.
  84. On releasing annotator-level labels and information in datasets. arXiv preprint arXiv:2110.05699, 2021.
  85. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  86. Rozado, D. The political preferences of llms. arXiv preprint arXiv:2402.01789, 2024.
  87. Algebraic aggregation theory. Journal of Economic Theory, 38(1):63–77, 1986.
  88. Preference elicitation in combinatorial auctions. In Cramton, P., Shoham, Y., and Steinberg, R. (eds.), Combinatorial Auctions, chapter 10, pp.  233–263. MIT Press, 2006.
  89. Satterthwaite, M. Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10:187–217, 1975.
  90. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  91. Schwartz, T. Cycles and Social Choice: The True and Unabridged Story of a Most Protean Paradox. Cambridge University Press, 3 2018. doi: 10.1017/9781316848371.
  92. Communication complexity of approximating voting rules. In Proceedings of the Eleventh International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pp.  593–602, Valencia, Spain, 2012.
  93. Benefits of assistance over reward learning. In NeurIPS Workshop on Cooperative AI, 2020.
  94. A critical evaluation of ai feedback for aligning large language models, 2024.
  95. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
  96. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  97. Taylor, A. D. Social Choice and the Mathematics of Manipulation. Cambridge University Press, Cambridge, 2005. doi: 10.1017/cbo9780511614316.
  98. Tideman, T. N. Independence of clones as a criterion for voting rules. Social Choice and Welfare, 4(3):185–206, 1987.
  99. Survey on reinforcement learning for language processing. Artificial Intelligence Review, 56(2):1543–1575, 2023.
  100. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
  101. Robust combinatorial auction protocol against false-name bids. Artificial Intelligence, 130(2):167–181, 2001.
  102. The effect of false-name bids in combinatorial auctions: New fraud in Internet auctions. Games and Economic Behavior, 46(1):174–188, 2004.
  103. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  104. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023.
  105. Zwicker, W. S. Introduction to the theory of voting. In Brandt, F., Conitzer, V., Endriss, U., Lang, J., and Procaccia, A. D. (eds.), Handbook of Computational Social Choice, pp.  23–56. Cambridge University Press, New York, 2016. doi: 10.1017/cbo9781107446984.003.
Citations (17)

Summary

  • The paper introduces RLCHF, a novel framework that leverages social choice theory to aggregate collective human feedback for AI alignment.
  • It critically evaluates RLHF and Constitutional AI, highlighting the limitations of these methods in capturing diverse human preferences.
  • The study demonstrates that computational social choice can enhance fairness, accuracy, and inclusivity in training large language models.

Social Choice for AI Alignment: A Framework for Incorporating Diverse Human Feedback

Introduction to the Challenge

The development and fine-tuning of LLMs incorporating human feedback have highlighted significant challenges, especially regarding the diversity and potential divergence of human input. This paper discusses the relevancy and application of social choice theory as a structured approach to these problems. Specifically, it scrutinizes the difficulties inherent in reinforcement learning from human feedback (RLHF) and proposes a more principled method to align LLMs with collective human values and preferences.

Value Alignment and RLHF: Current State

Value alignment in AI systems focuses on ensuring AI behaves in a way that aligns with human values. Reinforcement learning from human feedback (RLHF) has been critical in aligning pretrained LLMs with these values. However, the RLHF approach faces significant limitations, including challenges in dealing with unrepresentative data, oversimplified models of human decision-making, and lack of consideration for human diversity. This paper critically evaluates RLHF and Constitutional AI as prevailing methodologies, highlighting their insufficiency in effectively capturing and reflecting collective human preferences in AI behavior.

Constitutional AI and the Promise of Social Choice Theory

The paper contrasts RLHF with Constitutional AI (CAI), presenting CAI's approach of employing high-level human-written principles for AI training. It argues that both methods inadequately address the aggregation of diverse human input into a coherent set of guidelines for AI behavior, a gap effectively bridged by social choice theory. By leveraging social choice, the paper claims we can avoid na\"ive aggregation pitfalls, such as cyclical preferences or inconsistencies, ensuring AI systems better represent collective human judgments.

The Role of Computational Social Choice in AI Alignment

Computational social choice offers a rich toolkit for aggregating individual preferences, judgments, or principles into collective decisions. This paper argues for its application in tackling key questions in AI alignment, such as identifying relevant stakeholders for feedback, formatting and aggregating diverse types of feedback, and making collective decisions on AI behavior from this feedback. Through computational social choice, concerns about fairness, accuracy, and inclusivity in feedback collection and aggregation can be systematically addressed.

Novel Frameworks: From RLHF to RLCHF and Beyond

The paper introduces Reinforcement Learning from Collective Human Feedback (RLCHF) as a novel framework that integrates social choice directly into the RLHF process. This integration allows for the aggregation of individual judgments into a collective feedback mechanism before fine-tuning AI models, potentially resulting in fairer and more representative AI systems. Furthermore, it explores the potential of Supervised Learning from Simulated Collective Decisions (SLSCD) as an approach that utilizes social choice theory not just for preference aggregation but for simulating collective decisions to guide AI behavior directly.

Key Concepts in Social Choice Relevant to AI

Highlighting the relevance of concepts such as independence of clones, strategic voting, anonymity, and principles as voters, the paper discusses how these traditional social choice considerations can meaningfully inform the design and implementation of AI systems aligned with collective human values. It also hints at exploring cooperative AI to manage the potential interaction between multiple AIs trained with differing collective inputs.

Addressing Behavioral and Multi-Agent Considerations

The paper acknowledges the complexity introduced by human behavioral factors in preference elicitation and the potential for strategic manipulation of feedback. It suggests further research to understand and mitigate these effects. Additionally, it contemplates the scenario of navigating interactions between multiple AIs aligned to different collective preferences, emphasizing the need for cooperation and conflict avoidance.

Conclusion and Path Forward

Concluding, the paper urges a multidisciplinary effort, crossing the boundaries between AI ethics, safety research, and social choice theory, to develop principled and practical methods for incorporating diverse human preferences into AI systems. By systematically applying insights from social choice, the field can make significant strides towards creating AI systems that are truly aligned with the broad spectrum of human values and preferences.