Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Models Represent Beliefs of Self and Others (2402.18496v3)

Published 28 Feb 2024 in cs.AI and cs.CL

Abstract: Understanding and attributing mental states, known as Theory of Mind (ToM), emerges as a fundamental capability for human social reasoning. While LLMs appear to possess certain ToM abilities, the mechanisms underlying these capabilities remain elusive. In this study, we discover that it is possible to linearly decode the belief status from the perspectives of various agents through neural activations of LLMs, indicating the existence of internal representations of self and others' beliefs. By manipulating these representations, we observe dramatic changes in the models' ToM performance, underscoring their pivotal role in the social reasoning process. Additionally, our findings extend to diverse social reasoning tasks that involve different causal inference patterns, suggesting the potential generalizability of these representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
  2. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009.
  3. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1(4):0064, 2017.
  4. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46, 1985.
  5. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020.
  6. Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  7. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  9. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  10. Functional activity of the right temporo-parietal junction and of the medial prefrontal cortex associated with true and false belief reasoning. Neuroimage, 60(3):1652–1661, 2012.
  11. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  12. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  13. The neural basis of mentalizing. Neuron, 50(4):531–534, 2006.
  14. Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others. arXiv preprint arXiv:2102.11938, 2021.
  15. Understanding social reasoning in language models with language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  16. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  17. Cause and intent: Social reasoning in causal learning. In Proceedings of the 31st annual conference of the cognitive science society, pp.  2759–2764. Cognitive Science Society Amsterdam, 2009.
  18. Language models represent space and time. In The Twelfth International Conference on Learning Representations, 2024.
  19. Henrich, J. The secret of our success: How culture is driving human evolution, domesticating our species, and making us smarter. princeton University press, 2016.
  20. The naive utility calculus as a unified, quantitative framework for action understanding. Cognitive Psychology, 123:101334, 2020.
  21. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  24. Mmtom-qa: Multimodal theory of mind question answering, 2024.
  25. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In CogSci, 2016.
  26. Kosinski, M. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  27. Revisiting the evaluation of theory of mind through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  28. Leslie, A. M. Pretense and representation: The origins of” theory of mind.”. Psychological review, 94(4):412, 1987.
  29. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023a.
  30. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  31. Tomchallenges: A principle-guided dataset and diverse evaluation tasks for exploring theory of mind. arXiv preprint arXiv:2305.15068, 2023.
  32. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.  746–751, 2013.
  33. Boosting theory-of-mind performance in large language models via prompting. arXiv preprint arXiv:2304.11490, 2023.
  34. Understanding the minds of others: A neuroimaging meta-analysis. Neuroscience & Biobehavioral Reviews, 65:276–291, 2016.
  35. Relative representations enable zero-shot latent space communication. arXiv preprint arXiv:2209.15430, 2022.
  36. Evaluating theory of mind in question answering. arXiv preprint arXiv:1808.09352, 2018.
  37. The alignment problem from a deep learning perspective, 2023.
  38. Do 15-month-old infants understand false beliefs? science, 308(5719):255–258, 2005.
  39. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
  40. Machine theory of mind. In International conference on machine learning, pp.  4218–4227. PMLR, 2018.
  41. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  42. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763, 2023.
  43. Agent: A benchmark for core psychological reasoning. In International Conference on Machine Learning, pp.  9614–9625. PMLR, 2021.
  44. Theory of minds: Understanding behavior in groups through inverse planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  6163–6170, 2019.
  45. Core knowledge. Developmental science, 10(1):89–96, 2007.
  46. Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
  47. Understanding and sharing intentions: The origins of cultural cognition. Behavioral and brain sciences, 28(5):675–691, 2005.
  48. Satisficing models of bayesian theory of mind for explaining behavior of differently uncertain agents. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, Stockholm, Sweden, pp.  10–15, 2018.
  49. Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
  50. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  51. Theory of mind abilities of large language models in human-robot interaction: An illusion? arXiv preprint arXiv:2401.05302, 2024.
  52. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. 2023.
  53. Tom2c: Target-oriented multi-agent communication and cooperation with theory of mind. In International Conference on Learning Representations, 2022.
  54. Meta-analysis of theory-of-mind development: The truth about false belief. Child development, 72(3):655–684, 2001.
  55. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. arXiv preprint arXiv:2311.10227, 2023.
  56. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv:2310.03051, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wentao Zhu (73 papers)
  2. Zhining Zhang (2 papers)
  3. Yizhou Wang (162 papers)
Citations (5)