Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Roadmap to Pluralistic Alignment (2402.05070v3)

Published 7 Feb 2024 in cs.AI, cs.CL, and cs.IR

Abstract: With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using LLMs as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (131)
  1. Oxford English Dictionary, s.v. “Overton window (n.)”, July 2023. URL https://doi.org/10.1093/OED/1985277434.
  2. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  3. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pp. 337–371. PMLR, 2023.
  4. Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  5. Out of one, many: Using language models to simulate human samples. Political Analysis, 31:1–15, 02 2023. doi: 10.1017/pan.2023.2.
  6. The legitimacy of representation: How descriptive, formal, and responsiveness representation affect the acceptability of political decisions. Comparative Political Studies, 51(7):868–899, 2018. doi: 10.1177/0010414017720702. URL https://doi.org/10.1177/0010414017720702.
  7. Dices dataset: Diversity in conversational ai evaluation for safety, 2023.
  8. A general language assistant as a laboratory for alignment, 2021.
  9. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
  10. Constitutional ai: Harmlessness from ai feedback, 2022b.
  11. Fine-tuning language models to find agreement among humans with diverse preferences, 2022.
  12. Berlin, I. Two concepts of liberty. In Four Essays on Liberty, pp.  118–172. Oxford University Press, Oxford, 1969.
  13. Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023.
  14. On the opportunities and risks of foundation models. ArXiv, abs/2108.07258, 2021. URL https://arxiv.org/pdf/2108.07258.pdf.
  15. Measuring progress on scalable oversight for large language models, 2022.
  16. Balance as bias: global warming and the us prestige press. Global Environmental Change, 14(2):125–136, 2004. ISSN 0959-3780. doi: https://doi.org/10.1016/j.gloenvcha.2003.10.001. URL https://www.sciencedirect.com/science/article/pii/S0959378003000669.
  17. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.org/CorpusID:218971783.
  18. Buttrick, N. Studying large language models as compression algorithms for human culture. Trends in Cognitive Sciences, S1364-6613(24):00001–9, 2024. doi: 10.1016/j.tics.2024.01.001. Epub ahead of print.
  19. Open problems and fundamental limitations of reinforcement learning from human feedback. ArXiv, abs/2307.15217, 2023. URL https://api.semanticscholar.org/CorpusID:260316010.
  20. When large language models meet personalization: Perspectives of challenges and opportunities, 2023.
  21. Decision transformer: Reinforcement learning via sequence modeling, 2021.
  22. Cotra, A. Why ai alignment could be hard with modern deep learning. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/, 2021.
  23. Crenshaw, K. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. The University of Chicago Legal Forum, 140:139–167, 1989.
  24. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi: 10.1145/3544548.3580672. URL https://doi.org/10.1145/3544548.3580672.
  25. de Tocqueville, A. Democracy in America. 1835.
  26. Towards measuring the representation of subjective global opinions in language models, 2023. URL https://api.semanticscholar.org/CorpusID:259275051.
  27. Utility is in the eye of the user: A critique of nlp leaderboard design. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:235408131.
  28. The authenticity gap in human evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6056–6070, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.406. URL https://aclanthology.org/2022.emnlp-main.406.
  29. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models, 2023.
  30. Fair algorithms for selecting citizens’assemblies. Nature, 596(7873):548–552, 2021. doi: 10.1038/s41586-021-03788-6. URL https://doi.org/10.1038/s41586-021-03788-6.
  31. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks, November 2023. URL http://arxiv.org/abs/2305.06626. arXiv:2305.06626 [cs].
  32. Gabriel, I. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, 2020. doi: 10.1007/s11023-020-09539-2. URL https://doi.org/10.1007/s11023-020-09539-2.
  33. Ideas are dimes a dozen: Large language models for idea generation in innovation. https://ssrn.com/abstract=4526071, July 2023. Available at SSRN: https://ssrn.com/abstract=4526071 or http://dx.doi.org/10.2139/ssrn.4526071.
  34. Improving alignment of dialogue agents via targeted human judgements, 2022.
  35. Jury learning: Integrating dissenting voices into machine learning models. In CHI Conference on Human Factors in Computing Systems, CHI ’22. ACM, April 2022. doi: 10.1145/3491102.3502004. URL http://dx.doi.org/10.1145/3491102.3502004.
  36. The hypervolume indicator. ACM Computing Surveys (CSUR), 54:1–42, 2020. URL https://api.semanticscholar.org/CorpusID:218470181.
  37. Haraway, D. Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14(3):575–599, 1988. ISSN 00463663. URL http://www.jstor.org/stable/3178066.
  38. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988.
  39. The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation, 2023.
  40. How far can we extract diverse perspectives from large language models? criteria-based diversity prompting! ArXiv, abs/2311.09799, 2023. URL https://api.semanticscholar.org/CorpusID:265220883.
  41. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, April 2022. ISSN 1573-7454. doi: 10.1007/s10458-022-09552-y. URL http://dx.doi.org/10.1007/s10458-022-09552-y.
  42. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020.
  43. Aligning ai with shared human values, 2023.
  44. The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3):61–83, 2010. URL http://www2.psych.ubc.ca/~henrich/audiofiles/WEIRD1.mp3.
  45. Human feedback is not gold standard. ArXiv, abs/2309.16349, 2023. URL https://api.semanticscholar.org/CorpusID:263134280.
  46. Incommensurable Values. In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall 2021 edition, 2021.
  47. Aligning Language Models to User Opinions. 2023. doi: 10.48550/ARXIV.2305.14929. URL https://arxiv.org/abs/2305.14929. Publisher: arXiv Version Number: 1.
  48. When fairness is flawed: Effects of false balance reporting and weight-of-evidence statements on beliefs and perceptions of climate change. Journal of Applied Research in Memory and Cognition, 11, 10 2021. doi: 10.1016/j.jarmac.2021.10.002.
  49. Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi: 10.1145/3544548.3581196. URL https://doi.org/10.1145/3544548.3581196.
  50. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023.
  51. Ai alignment: A comprehensive survey, 2024.
  52. Early-stopped neural networks are consistent, 2021.
  53. Evaluating and inducing personality in pre-trained language models, 2023. URL https://api.semanticscholar.org/CorpusID:258865158.
  54. Communitylm: Probing partisan worldviews from language models, 2022.
  55. Maieutic prompting: Logically consistent reasoning with recursive explanations, 2022.
  56. Kant, I. Kant: Critique of Practical Reason. Cambridge Texts in the History of Philosophy. Cambridge University Press, 2 edition, 1788. doi: 10.1017/CBO9781316136478.
  57. In conversation with artificial intelligence: aligning language models with human values, 2022.
  58. Kekes, J. The Morality of Pluralism. Princeton University Press, Princeton, 1993.
  59. Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys. arXiv preprint arXiv:2305.09620, 2023.
  60. Understanding the effects of rlhf on llm generalisation and diversity, 2024.
  61. Human-centred mechanism design with democratic ai. Nature Human Behaviour, 6(10):1398–1407, 2022. doi: 10.1038/s41562-022-01383-x. URL https://doi.org/10.1038/s41562-022-01383-x.
  62. Chatgpt’s inconsistent moral advice influences users’judgment. Scientific Reports, 13(1):4569, Apr 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-31341-0. URL https://doi.org/10.1038/s41598-023-31341-0.
  63. Deliberation and disagreement: Problem solving, prediction, and positive dissensus. Politics, philosophy & economics, 14(3):229–254, 2015.
  64. Scalable agent alignment via reward modeling: a research direction, 2018.
  65. Teach llms to personalize – an approach inspired by writing education, 2023a.
  66. On the steerability of large language models toward data-driven personas. 2023b. doi: 10.48550/ARXIV.2311.04978. URL https://arxiv.org/abs/2311.04978. Publisher: arXiv Version Number: 1.
  67. Holistic evaluation of language models, 2023.
  68. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235313967.
  69. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022.
  70. Tuning language models by proxy, 2024.
  71. Are sample-efficient nlp models more robust?, 2023.
  72. Long, J. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  73. Neurologic decoding: (un)supervised neural text generation with predicate logic constraints. ArXiv, abs/2010.12884, 2020. URL https://api.semanticscholar.org/CorpusID:225067055.
  74. Quark: Controllable text generation with reinforced unlearning, 2022.
  75. Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023.
  76. MacAskill, W. Normative Uncertainty as a Voting Problem. Mind, 125(500):967–1004, October 2016. ISSN 0026-4423. doi: 10.1093/mind/fzv169. URL https://doi.org/10.1093/mind/fzv169.
  77. Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702, 2023.
  78. AmbigQA: Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5783–5797, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.466. URL https://aclanthology.org/2020.emnlp-main.466.
  79. Mishra, A. Ai alignment and social choice: Fundamental limitations and policy implications, 2023.
  80. Moulin, H. Fair Division and Collective Welfare. MIT Press, 2004.
  81. Nagel, T. The fragmentation of value. In Mortal Questions. Cambridge University Press, Cambridge, 1979.
  82. OpenAI. Openai davinci-002 model. https://www.openai.com, 2023a. Accessed on Date 06/2023.
  83. OpenAI. Openai gpt3.5-turbo. https://www.openai.com, 2023b. Accessed on Date 06/2023.
  84. Training language models to follow instructions with human feedback, 2022.
  85. Ovadya, A. Reimagining democracy for ai. Journal of Democracy, 34(4):162–170, Oct 2023.
  86. Page, S. The difference: How the power of diversity creates better groups, firms, schools, and societies-new edition. Princeton University Press, 2008.
  87. Page, S. E. The diversity bonus: How great teams pay off in the knowledge economy. Princeton University Press, 2019.
  88. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark, 2023.
  89. Social simulacra: Creating populated prototypes for social computing systems, 2022.
  90. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.  1–22, 2023.
  91. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  92. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, pp.  13387–13434, July 2022. doi: 10.18653/v1/2023.findings-acl.847. URL https://aclanthology.org/2023.findings-acl.847.
  93. Pew Research Center. The state of online harassment. Technical report, Washington, D.C., January 2021. URL https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/.
  94. Cold decoding: Energy-based constrained text generation with langevin dynamics. ArXiv, abs/2202.11705, 2022. URL https://api.semanticscholar.org/CorpusID:247058662.
  95. Knowledge of cultural moral norms in large language models, 2023.
  96. Ratcliffe, S. Albert einstein, 2016. URL https://www.oxfordreference.com/view/10.1093/acref/9780191826719.001.0001/q-oro-ed4-00003988.
  97. Rawls, J. A Theory of Justice: Original Edition. Harvard University Press, 1971. ISBN 9780674880108. URL http://www.jstor.org/stable/j.ctvjf9z6v.
  98. Rawls, J. Political Liberalism. Columbia University Press, New York, 1996.
  99. Raz, J. Engaging Reason: On the Theory of Value and Action. Oxford University Press, Oxford, 1999.
  100. Whose opinions do language models reflect?, 2023.
  101. Nlpositionality: Characterizing design biases of datasets and models, 2023.
  102. Evaluating the moral beliefs encoded in llms, 2023.
  103. Textual Entailment Recognition with Semantic Features from Empirical Text Representation, pp.  183–195. Springer International Publishing, 2023. ISBN 9783031332319. doi: 10.1007/978-3-031-33231-9_12. URL http://dx.doi.org/10.1007/978-3-031-33231-9_12.
  104. Evaluating large language model creativity from a literary perspective, 2023.
  105. Role-play with large language models, 2023.
  106. Human–ai collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1):46–57, 2023a. doi: 10.1038/s42256-022-00593-2. URL https://doi.org/10.1038/s42256-022-00593-2.
  107. Cognitive reframing of negative thoughts through human-language model interaction. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9977–10000, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.555. URL https://aclanthology.org/2023.acl-long.555.
  108. Facilitating self-guided mental health interventions through human-language model interaction: A case study of cognitive restructuring. ArXiv, abs/2310.15461, 2023c. URL https://api.semanticscholar.org/CorpusID:264439507.
  109. Cognitive reframing of negative thoughts through human-language model interaction, 2023d.
  110. Sher, G. On the possibility of a substantive theory of truth. Synthese, 117:133–172, 1998.
  111. Simmons, G. Moral mimicry: Large language models produce moral rationalizations tailored to political identity. In Padmakumar, V., Vallejo, G., and Fu, Y. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pp.  282–297, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-srw.40. URL https://aclanthology.org/2023.acl-srw.40.
  112. Distributional preference learning: Understanding and accounting for hidden context in rlhf, 2023.
  113. Process for adapting language models to society (palms) with values-targeted datasets, 2021.
  114. The typing cure: Experiences with large language model chatbots for mental health support, 2024.
  115. Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties, 2023.
  116. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
  117. Cognitive architectures for language agents, 2023.
  118. A minimaximalist approach to reinforcement learning from human feedback, 2024.
  119. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  120. Tasioulas, J. Artificial Intelligence, Humanistic Ethics. Daedalus, 151(2):232–243, 05 2022. ISSN 0011-5266. doi: 10.1162/daed_a_01912. URL https://doi.org/10.1162/daed_a_01912.
  121. Simulating social media using large language models to evaluate alternative news feed algorithms. arXiv preprint arXiv:2310.05984, 2023.
  122. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://api.semanticscholar.org/CorpusID:259950998.
  123. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv, abs/2306.04751, 2023a. URL https://api.semanticscholar.org/CorpusID:259108263.
  124. Aligning large language models with human: A survey, 2023b.
  125. Birdwatch: Crowd wisdom and bridging algorithms can inform understanding and reduce the spread of misinformation, 2022.
  126. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022.
  127. Wright, C. Truth and Objectivity. Harvard University Press, Cambridge, MA, 1992.
  128. A generalized algorithm for multi-objective reinforcement learning and policy adaptation, 2019.
  129. Group Preference Optimization: Few-Shot Alignment of Large Language Models. 2023. doi: 10.48550/ARXIV.2310.11523. URL https://arxiv.org/abs/2310.11523. Publisher: arXiv Version Number: 1.
  130. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty, 2024.
  131. Can large language models transform computational social science? arXiv preprint arXiv:2305.03514, 2023.
Citations (46)

Summary

  • The paper introduces pluralistic alignment strategies by outlining Overton, steerable, and distributional approaches to capture diverse human values.
  • The paper demonstrates that standard alignment methods like RLHF can narrow response diversity, highlighting the need for more varied benchmarks.
  • The paper advocates for refined evaluations and pluralistic frameworks to develop AI models that accurately reflect a spectrum of human perspectives.

The Concept of Pluralism in AI Systems

Introduction to Pluralism

Assessing AI systems' capability to align with human values has taken center stage in current research, given the complex and diverse nature of human perspectives and values. Attempts to tailor AI responses to fit an averaged preference often overlooks the inherent human diversity, presenting an urgent need for AI models that accommodate a broad spectrum of values – in other words, pluralistic systems. Establishing a pluralistic AI system involves presenting an array of reasonable responses, adjusting outputs to reflect specific perspectives, and accurately representing the diversity in a given population.

Defining Pluralism in AI

Efforts to formalize pluralism in AI models have proposed three main approaches: Overton pluralism that encompasses a range of reasonable responses, Steerable pluralism which allows for the reflection of particular attributes or perspectives, and Distributional pluralism that calibrates model outputs according to a given population's distribution. Moreover, aligning incident benchmarks with pluralism can be approached through multi-objective benchmarks, benches that measure the model's flexibility in steering among various objectives, and benchmarks that model a diverse range of human ratings.

The empirical evidence suggests that existing AI alignment strategies may inadvertently diminish distributional pluralism. Models trained on standard alignment procedures like Reinforcement Learning from Human Feedback (RLHF) tend to concentrate on a less varied set of answers, deviating from the more distributed nature of human responses.

The Relationship Between Alignment Techniques and Pluralism

Current alignment practices such as RLHF, where models are optimized to maximize human preferences derived from limited data sets, often ignore the nuances of human variance. However, certain current alignment techniques indicate capabilities for Overton pluralism, to a degree that human preferences allow. Although LLMs exhibit a form of steerable pluralism, further assessment is needed to evaluate this property comprehensively. Notably, the methodologies to ascertain pluralistic benchmarks and the resulting degree of pluralism in LLMs, necessitate deeper investigation.

Future Research Directions

The roadmap towards pluralistic AI emphasizes the necessity for more refined evaluations and the development of pluralistic frameworks that provide a comprehensive understanding of a model's characteristics. It also foregrounds the importance of engagement in normative discussions regarding what and whom we align AI systems to and what bounds of customization are considered acceptable. Additional research is needed to explore alignment techniques capable of producing more pluralistically-aligned models effectively.

In essence, this paper stands as a pivotal effort in charting the course for the creation and measurement of AI systems that genuinely resonate with, and respect, the diverse values and perspectives that shape human societies.