Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback (2404.10271v2)
Abstract: Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, such as helping to commit crimes or producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about "collective" preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.
- Anthropic. Introducing claude, 2023. https://www.anthropic.com/index/introducing-claude, retrieved 2024-01-31.
- Arrow, K. J. Social Choice and Individual Values. John Wiley & Sons, Inc., New York, 1st edition, 1951.
- Arrow, K. J. Social Choice and Individual Values. Yale University Press, 2012.
- Social Choice and Multicriterion Decision-Making. The MIT Press, 1986.
- The Moral Machine experiment. Nature, 563(7729):59–64, November 2018. ISSN 1476-4687. doi: 10.1038/s41586-018-0637-6. URL https://doi.org/10.1038/s41586-018-0637-6.
- Justified representation in approval-based committee voting. Social Choice and Welfare, 48(2):461–485, 2017.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional AI: Harmlessness from AI feedback, 2022b. arXiv:2212.08073.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- Majority Judgement: Measuring, Ranking and Electing. MIT Press, Boston, 2010. doi: 10.7551/mitpress/9780262015134.001.0001.
- Voting procedures. In Arrow, K. J., Sen, A. K., and Suzumura, K. (eds.), Handbook of Social Choice and Welfare, volume 1, pp. 173–236. North-Holland, Amsterdam, 2002. doi: 10.1016/s1574-0110(02)80008-x.
- Brandt, F. Rolling the dice: Recent results in probabilistic social choice. In Endriss, U. (ed.), Trends in Computational Social Choice, pp. 3–26. AI Access, 2017.
- Handbook of Computational Social Choice. Cambridge University Press, 2015.
- Impossibility theorems in the Arrovian framework. In Arrow, K. J., Sen, A. K., and Suzumura, K. (eds.), Handbook of Social Choice and Welfare, volume 1, pp. 35–94. North-Holland, Amsterdam, 2002.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Suppressing pink elephants with direct principle feedback, 2024.
- Computational Aspects of Cooperative Game Theory. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Foundations of cooperative AI. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, pp. 15359–15367, Washington, DC, USA, 2023.
- Vote elicitation: Complexity and strategy-proofness. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 392–397, Edmonton, AB, Canada, 2002.
- Communication complexity of common voting rules. In Proceedings of the ACM Conference on Electronic Commerce (EC), pp. 78–87, Vancouver, BC, Canada, 2005.
- Using mechanism design to prevent false-name manipulations. AI Magazine, 31(4):65–77, 2010.
- Dominating manipulations in voting with partial information. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI-11), pp. 638–643. AAAI Press, 2011.
- Cooperative AI: machines must learn to find common ground. Nature, 593(7857):33–36, 2021.
- Social welfare functionals and interpersonal comparability. In Arrow, K. J., Sen, A. K., and Suzumura, K. (eds.), Handbook of Social Choice and Welfare, volume 1, pp. 459–541. Elsevier Science B.V., 2002.
- Probabilistic opinion pooling. In Hajek, A. and Hitchcock, C. (eds.), Oxford Handbook of Philosophy and Probability. Oxford University Press, Oxford, 2016.
- Hard choices in artificial intelligence. Artificial Intelligence, 300:103555, 2021.
- Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023.
- Rationalizations of voting rules. In Brandt, F., Conitzer, V., Endriss, U., Lang, J., and Procaccia, A. D. (eds.), Handbook of Computational Social Choice, chapter 8. Cambridge University Press, 2015.
- Properties of multiwinner voting rules. Social Choice and Welfare, 48:599–632, 2017.
- Endriss, U. Judgment aggregation. In Brandt, F., Conitzer, V., Endriss, U., Lang, J., and Procaccia, A. D. (eds.), Handbook of Computational Social Choice, chapter 17. Cambridge University Press, 2015.
- Human-centered loss functions (halos). Technical report, Contextual AI, 2023. https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf, retrieved 2024-01-31.
- Multiwinner voting: A new challenge for social choice theory. Trends in computational social choice, 74(2017):27–47, 2017.
- Moral machine or tyranny of the majority? arXiv preprint arXiv:2305.17319, 2023.
- Generative social choice. arXiv preprint arXiv:2309.01291, 2023.
- Fishburn, P. C. The Theory of Social Choice. Princeton Legacy Library. Princeton University Press, 1973.
- Fair algorithms for selecting citizens’ assemblies. Nature, 596:548–552, 2021.
- Adapting a kidney exchange algorithm to align with human values. Artificial Intelligence, 283(103261), 2020.
- Choice set misspecification in reward inference. arXiv preprint arXiv:2101.07691, 2021.
- Active teacher selection for reinforcement learning from human feedback. arXiv preprint arXiv:2310.15288, 2023.
- Gabriel, I. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
- Ganguli, D. et al. Collective constitutional AI: Aligning a language model with public input. Anthropic, 2023. https://www.anthropic.com/index/collective-constitutional-ai-aligning-a-language-model-with-public-input, retrieved 2024-01-31.
- Gibbard, A. Manipulation of voting schemes: a general result. Econometrica, 41:587–601, 1973.
- Google. Bard, 2023. https://bard.google.com/, retrieved 2024-01-31.
- Embedding ethical principles in collective decision support systems. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 4147–4151, Phoenix, AZ, USA, 2016.
- Judgment Aggregation: A Primer. Synthesis Lectures on Artificial Intelligence and Machine Learning. Springer Cham, 1 edition, 2022. doi: 10.1007/978-3-031-01568-7.
- Representation with incomplete votes. In Williams, B., Chen, Y., and Neville, J. (eds.), Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp. 5657–5664. AAAI Press, 2023. doi: 10.1609/AAAI.V37I5.25702. URL https://doi.org/10.1609/aaai.v37i5.25702.
- Strategic voting under uncertainty about the voting method. In Moss, L. S. (ed.), Theoretical Aspects of Rationality and Knowledge: Proceedings of the 2019 Conference (TARK 2019), volume 297 of Electronic Proceedings in Theoretical Computer Science, pp. 252–272. EPTCS, 2019. doi: 10.4204/EPTCS.297.17.
- Arrow’s decisive coalitions. Social Choice and Welfare, 54:463–505, 2020. doi: 10.1007/s00355-018-1163-z.
- On the sensitivity of reward inference to misspecified human models. arXiv preprint arXiv:2212.04717, 2022.
- Language agents as digital representatives in collective decision-making. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–291, 1979. ISSN 00129682, 14680262.
- Kelly, J. S. Social Choice Theory: An Introduction. Springer, Berlin, 1988. doi: 10.1007/978-3-662-09925-4.
- Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. arXiv preprint arXiv:2303.05453, 2023.
- Communication Complexity. Cambridge University Press, 1997.
- Multi-Winner Voting with Approval Preferences. SpringerBriefs in Intelligent Systems. Springer Cham, 2023. doi: 10.1007/978-3-031-09016-5.
- Composition-consistent tournament solutions and social choice functions. Social Choice and Welfare, 13:75–93, 1996. doi: 10.1007/BF00179100.
- The alignment ceiling: Objective mismatch in reinforcement learning from human feedback. arXiv preprint arXiv:2311.00168, 2023.
- The history and risks of reinforcement learning and human feedback. arXiv preprint arXiv:2310.13595, 2023.
- Citizens’ assemblies, a new form of democratic representation? Participations: Revue de sciences sociales sur la démocratie et la citoyenneté, 34:5–36, 2022.
- Lang, J. Vote and aggregation in combinatorial domains with structured preferences. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), pp. 1366–1371, Hyderabad, India, 2007.
- Handbook on Approval Voting. Studies in Choice and Welfare. Springer Berlin Heidelberg, 1 edition, 2010. doi: 10.1007/978-3-642-02839-7.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021.
- Eliciting human preferences with language models. arXiv preprint arXiv:2310.11589, 2023.
- Aggregating sets of judgments: An impossibility result. Economics & Philosophy, 18(1):89–110, 2002.
- Indecision modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 5975–5983, 2021.
- Classics of Social Choice. The University of Michigan Press, Ann Arbor, 1995.
- Meng, X. Scalable simple random sampling and stratified sampling. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.
- Meta. Meta and microsoft introduce the next generation of llama, 2023. https://about.fb.com/news/2023/07/llama-2/, retrieved 2024-01-31.
- Mishra, A. AI alignment and social choice: Fundamental limitations and policy implications. arXiv preprint arXiv:2310.16048, 2023.
- More human than human: measuring chatgpt political bias. Public Choice, 198:3–23, 2023.
- Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, 23(2):289–297, 1982.
- A voting-based system for ethical decision making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Okasha, S. Theory choice and social choice: Kuhn versus Arrow. Mind, 120(477):83–115, 2011.
- OpenAI. GPT-4 technical report, 2023.
- OpenAI. Democratic inputs to ai grant program: lessons learned and implementation plans, 2024. https://openai.com/blog/democratic-inputs-to-ai-grant-program-update, retrieved 2024-01-31.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pacuit, E. Voting methods. In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall 2019 edition, 2019.
- Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pp. 26837–26867. PMLR, 2023.
- Paulin, A. An overview of ten years of liquid democracy research. In Proceedings of the 21st Annual International Conference on Digital Government Research, pp. 116–121, 2020. doi: 10.1145/3396956.3396963.
- Cultural bias in explainable ai research: A systematic analysis. J. Artif. Int. Res., 79, mar 2024. ISSN 1076-9757. doi: 10.1613/jair.1.14888. URL https://doi.org/10.1613/jair.1.14888.
- Pivato, M. Epistemic democracy with correlated voters. Journal of Mathematical Economics, 72:51–69, 2017.
- On releasing annotator-level labels and information in datasets. arXiv preprint arXiv:2110.05699, 2021.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Rozado, D. The political preferences of llms. arXiv preprint arXiv:2402.01789, 2024.
- Algebraic aggregation theory. Journal of Economic Theory, 38(1):63–77, 1986.
- Preference elicitation in combinatorial auctions. In Cramton, P., Shoham, Y., and Steinberg, R. (eds.), Combinatorial Auctions, chapter 10, pp. 233–263. MIT Press, 2006.
- Satterthwaite, M. Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10:187–217, 1975.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Schwartz, T. Cycles and Social Choice: The True and Unabridged Story of a Most Protean Paradox. Cambridge University Press, 3 2018. doi: 10.1017/9781316848371.
- Communication complexity of approximating voting rules. In Proceedings of the Eleventh International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pp. 593–602, Valencia, Spain, 2012.
- Benefits of assistance over reward learning. In NeurIPS Workshop on Cooperative AI, 2020.
- A critical evaluation of ai feedback for aligning large language models, 2024.
- Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Taylor, A. D. Social Choice and the Mathematics of Manipulation. Cambridge University Press, Cambridge, 2005. doi: 10.1017/cbo9780511614316.
- Tideman, T. N. Independence of clones as a criterion for voting rules. Social Choice and Welfare, 4(3):185–206, 1987.
- Survey on reinforcement learning for language processing. Artificial Intelligence Review, 56(2):1543–1575, 2023.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
- Robust combinatorial auction protocol against false-name bids. Artificial Intelligence, 130(2):167–181, 2001.
- The effect of false-name bids in combinatorial auctions: New fraud in Internet auctions. Games and Economic Behavior, 46(1):174–188, 2004.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023.
- Zwicker, W. S. Introduction to the theory of voting. In Brandt, F., Conitzer, V., Endriss, U., Lang, J., and Procaccia, A. D. (eds.), Handbook of Computational Social Choice, pp. 23–56. Cambridge University Press, New York, 2016. doi: 10.1017/cbo9781107446984.003.