Strong and weak alignment of large language models with human values (2408.04655v2)
Abstract: Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like LLMs to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.
- \bibcommenthead
- Global catastrophic risks (Oxford University Press, USA, 2011).
- Rahwan, I. et al. Machine behaviour. Nature 568, 477–486 (2019).
- Klein, N. Ai machines aren’t ‘hallucinating’. but their makers are. The Guardian 8, 2023 (2023).
- Dennett, D. The problem with counterfeit people. The Atlantic 16 (2023).
- Ji, J. et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852 (2023).
- Christiano, P. F. et al. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
- Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems 36 (2024).
- Schwartz, S. H. Are there universal aspects in the structure and contents of human values? Journal of social issues 50, 19–45 (1994).
- Petit traité des valeurs (2018).
- Moral molecules: Morality as a combinatorial system. Review of Philosophy and Psychology 13, 1039–1058 (2022).
- Basic human values and moral foundations theory in valuenet ontology (2022).
- What are human values, and how do we align ai to them? arXiv preprint arXiv:2404.10636 (2024).
- Floridi, L. Ai as agency without intelligence: on chatgpt, large language models, and other generative models. Philosophy & Technology 36, 15 (2023).
- Large language models: The need for nuance in current debates and a pragmatic perspective on understanding. arXiv preprint arXiv:2310.19671 (2023).
- On the dangers of stochastic parrots: Can language models be too big? (2021).
- Harnad, S. The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 335–346 (1990).
- Generating meaning: active inference and the scope and limits of passive ai. Trends in Cognitive Sciences (2023).
- Ffab—the form function attribution bias in human–robot interaction. IEEE Transactions on Cognitive and Developmental Systems 10, 843–851 (2018).
- Anthropomorphism in ai. AJOB neuroscience 11, 88–95 (2020).
- Korteling, J. H. Human-versus artificial intelligence. Frontiers in artificial intelligence 4, 622364 (2021).
- Araujo, T. Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing on conversational agent and company perceptions. Computers in human behavior 85, 183–189 (2018).
- Do we collaborate with what we design? Topics in Cognitive Science (2023).
- Accountability and automation bias. International Journal of Human-Computer Studies 52, 701–717 (2000).
- Cummings, M. L. Automation bias in intelligent time critical decision support systems (2017).
- Sourdin, T. Judge v robot?: Artificial intelligence and judicial decision-making. University of New South Wales Law Journal, The 41, 1114–1133 (2018).
- Hellman, D. Measuring algorithmic fairness. Virginia Law Review 106, 811–866 (2020).
- Machine bias (2022).
- Christian, B. The alignment problem: How can machines learn human values? (Atlantic Books, 2021).
- Chen, Z. Ethics and discrimination in artificial intelligence-enabled recruitment practices. Humanities and Social Sciences Communications 10, 1–12 (2023).
- King, M. R. & ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cellular and molecular bioengineering 16, 1–2 (2023).
- Searle, J. R. Minds, brains, and programs. Behavioral and brain sciences 3, 417–424 (1980).
- Gabriel, I. Artificial intelligence, values, and alignment. Minds and Machines 30, 411 – 437 (2020). URL https://api.semanticscholar.org/CorpusID:210920551.
- Russell, S. Human Compatible: Artificial Intelligence and the Problem of Control (Penguin Publishing Group, 2019). URL https://books.google.fr/books?id=M1eFDwAAQBAJ.
- The book of why: the new science of cause and effect (Basic books, 2018).
- The effects of reward misspecification: Mapping and mitigating misaligned models. ArXiv abs/2201.03544 (2022). URL https://api.semanticscholar.org/CorpusID:245837268.
- Lindell, N. B. The dignity canon. Cornell JL & Public Policy 27, 415 (2017).
- Building machines that learn and think like people. Behavioral and brain sciences 40, e253 (2017).
- Chatila, R. et al. Toward self-aware robots. Frontiers in Robotics and AI 5, 88 (2018).
- LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62 (2022).
- L’action. La cognition: du neurone à la société (2018).
- Steward, H. A metaphysics for freedom (Oxford University Press, 2012).
- Artificial agency and large language models. Intellectica 81 (2024).
- Walsh, D. M. Organisms, agency, and evolution (Cambridge University Press, 2015).
- A stochastic process model for free agency under indeterminism. dialectica 72, 219–252 (2018).
- Swanepoel, D. Does artificial intelligence have agency? The mind-technology problem: Investigating minds, selves and 21st century artefacts 83–104 (2021).
- Deep learning for ai. Communications of the ACM 64, 58–65 (2021).
- Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences 120, e2218523120 (2023).
- Evers, K. Can we be epigenetically proactive? (Johannes Gutenberg-Universität Mainz Frankfurt am Main, 2016).
- Gandhi & Desai, M. H. An autobiography, or, The story of my experiments with truth (Navajivan Publishing House, 1927).
- Word meaning in minds and machines. Psychological review 130, 401 (2023).
- Kapoor, I. Celebrity humanitarianism: The ideology of global charity (Routledge, 2012).
- Le paradoxe de simpson illustré par des données de vaccination contre le covid-19. TheConversation (2021). URL https://theconversation.com/le-paradoxe-de-simpson-illustre-par-des-donnees-de-vaccination-contre-le-covid-19-170159.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
- Bian, N. et al. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421 (2023).
- Momennejad, I. et al. Evaluating cognitive maps and planning in large language models with cogeval. Advances in Neural Information Processing Systems 36 (2024).
- Liu, H. et al. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439 (2023).
- Word embeddings: A survey. ArXiv abs/1901.09069 (2019). URL https://api.semanticscholar.org/CorpusID:59316955.
- Arguments, more than confidence, explain the good performance of reasoning groups. Journal of Experimental Psychology: General 143, 1958 (2014).
- The enigma of reason (Harvard University Press, 2017).
- Kahneman, D. Thinking, fast and slow (macmillan, 2011).
- Beyond dichotomies in reinforcement learning. Nature Reviews Neuroscience 21, 576–586 (2020).
- Inhibitory control as a core process of creative problem solving and idea generation from childhood to adulthood. New directions for child and adolescent development 2016, 61–72 (2016).
- Khamassi, M. et al. Meta-learning, cognitive control, and physiological interactions between medial and lateral prefrontal cortex. Mars, R., Sallet, J., Rushworth, M. and Yeung, N.(Eds.), Neural Bases of Motivational and Cognitive Control (2011).
- Caluwaerts, K. et al. A biologically inspired meta-control navigation system for the psikharpax rat robot. Bioinspiration & biomimetics 7, 025009 (2012).
- Motivational control of goal-directed action. Animal learning & behavior 22, 1–18 (1994).
- Baldassarre, G. et al. Purpose for open-ended learning robots: A computational taxonomy, definition, and operationalisation. arXiv preprint arXiv:2403.02514 (2024).
- Gopnik, A. et al. A theory of causal learning in children: causal maps and bayes nets. Psychological review 111, 3 (2004).
- Infants infer social relationships between individuals who engage in imitative social interactions. Open Mind 8, 202–216 (2024).
- Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
- Huneman, P. D’une connaissance qui serait du semblant : grands modeles de langage et hypothese replika. Intellectica 81 (2024). In press.
- Becker, J. D. The phrasal lexicon (1975).
- Peters, A. M. The units of language acquisition Vol. 1 (CUP Archive, 1983).
- The neural representation of sequences: from transition probabilities to algebraic patterns and linguistic trees. Neuron 88, 2–19 (2015).
- Arrieta, A. B. et al. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information fusion 58, 82–115 (2020).
- Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends in cognitive sciences 17, 585–593 (2013).
- Friston, K. et al. Active inference and epistemic value. Cognitive neuroscience 6, 187–214 (2015).
- Stick to your role! stability of personal values expressed in large language models. arXiv preprint arXiv:2402.14846 (2024).
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
- Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022).
- Lethal autonomous weapon systems [ethical, legal, and societal issues]. IEEE Robotics & Automation Magazine 25, 123–126 (2018).
- Cummings, M. L. Artificial intelligence and the future of warfare (Chatham House for the Royal Institute of International Affairs, London, 2017).
- Ben-Elia, E. An exploratory real-world wayfinding experiment: A comparison of drivers’ spatial learning with a paper map vs. turn-by-turn audiovisual route guidance. Transportation Research Interdisciplinary Perspectives 9, 100280 (2021).
- Heersmink, R. Use of large language models might affect our cognitive skills. Nature Human Behaviour 1–2 (2024).