Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs achieve adult human performance on higher-order theory of mind tasks (2405.18870v2)

Published 29 May 2024 in cs.AI, cs.CL, and cs.HC
LLMs achieve adult human performance on higher-order theory of mind tasks

Abstract: This paper examines the extent to which LLMs have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

Assessment of Higher-Order Theory of Mind in LLMs Using Multi-Order Benchmarks

The paper investigates the development of higher-order Theory of Mind (ToM) capabilities in LLMs through a new benchmark suite, the Multi-Order Theory of Mind Question Answer (MoToMQA). The paper evaluates the performance of five LLMs—LaMDA, PaLM, Flan-PaLM, GPT-3.5, and GPT-4—against human adults. Notably, GPT-4 and Flan-PaLM exhibit performance at or near adult human levels, with GPT-4 surpassing human performance on sixth-order ToM inferences.

Introduction to Theory of Mind and Literature Context

ToM is a critical aspect of human social intelligence, enabling individuals to reason about others' mental states and predict behaviors. While previous research has confirmed ToM competencies in LLMs, most studies focus on second-order ToM. This paper extends the assessment to sixth-order ToM through a novel benchmark—MoToMQA—comprising true/false questions derived from short-form stories.

Methodology and Benchmark Design

MoToMQA is based on a validated ToM test for adults, encompassing seven stories with 20 statements each. These statements are categorized into orders and levels to distinguish between mental inferences and factual recall. The benchmark features two prompt conditions for LLMs (human and simplified) and controls for memory in human participants by presenting stories with and without visibility during testing.

Results and Analysis

ToM Task Performance

A comparison of LLM and human performances revealed significant variations:

  • Best Performing Models: GPT-4 and Flan-PaLM demonstrated superior performance, with no significant differences between them. Both models, however, outperformed GPT-3.5, PaLM, and LaMDA significantly.
  • Human vs. Models: Humans outperformed Flan-PaLM but showed no significant difference from GPT-4, underscoring GPT-4's capability in higher-order ToM.
  • Order-Specific Analysis:
    • GPT-4 and Flan-PaLM showed high accuracy across all orders, with distinct performance declines aligned with higher order complexities.
    • Human performance was significantly higher at fifth-order compared to fourth-order, suggesting a cognitive process enhancement.

Factual Task Performance

Both GPT-4 and Flan-PaLM excelled in factual recall, significantly outperforming other models but not differing significantly from human performance. This highlights the models' learning capacity extends beyond syntactic complexity to factual content comprehension.

Comparative Analysis

Notably, both humans and LLMs showcased better performance on factual tasks compared to ToM tasks, consistent with prior studies. This suggests that ToM reasoning poses a more significant challenge to models, likely due to the necessity of generalizing social knowledge from pretraining data.

Implications and Future Research

The research elucidates the potential of larger, fine-tuned models like GPT-4 and Flan-PaLM in achieving advanced ToM capabilities. This aligns with the notion of scaling laws, where model size and fine-tuning play pivotal roles in realizing ToM competencies. However, ethical considerations arise from such capabilities, including the potential for manipulation and the need for technical guardrails.

Continued investigations should:

  • Expand benchmarks to include multi-lingual and culturally diverse scenarios.
  • Extend the testing beyond sixth-order ToM to explore the limits of both human and model capacities.
  • Incorporate multimodal signals to better align LLM ToM assessments with real-world human interactions.

In conclusion, the paper presents compelling evidence of the advanced ToM capabilities in state-of-the-art LLMs, particularly GPT-4, and elucidates directions for future research to further understand and ethically leverage these capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Paying attention to inattentive survey respondents. Political Analysis, 27(2):145–162, 2019.
  3. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46, 1985.
  4. Google colaboratory. Building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners, pages 59–64, 2019.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. The problematic concept of native speaker in psycholinguistics: Replacing vague and harmful terminology with inclusive and accurate measures. Frontiers in psychology, 12:715843, 2021.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  9. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  10. Michael C Corballis. The evolution of language. 2017.
  11. Negotiating with other minds: the role of recursive theory of mind in negotiation with incomplete information. Autonomous Agents and Multi-Agent Systems, 31:250–287, 2017.
  12. Higher-order theory of mind is especially useful in unpredictable negotiations. Autonomous Agents and Multi-Agent Systems, 36(2):30, 2022.
  13. Robin IM Dunbar. The social brain: mind, language, and society in evolutionary perspective. Annual review of Anthropology, 32(1):163–181, 2003.
  14. A mechanism-based approach to mitigating harms from persuasive generative ai. arXiv preprint arXiv:2404.15058, 2024.
  15. Camila Fernández. Mindful storytellers: Emerging pragmatics and theory of mind development. First Language, 33(1):20–46, 2013.
  16. The ethics of advanced ai assistants. arXiv preprint arXiv:2404.16244, 2024.
  17. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36, 2024.
  18. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. arXiv preprint arXiv:2310.16755, 2023.
  19. Fritz Heider. Attitudes and cognitive organization. The Journal of psychology, 21(1):107–112, 1946.
  20. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  21. Mentalizing about emotion and its relationship to empathy. Social cognitive and affective neuroscience, 3(3):204–217, 2008.
  22. Nicholas K Humphrey. The social function of intellect. 1976.
  23. Gender differences in verbal ability: A meta-analysis. Psychological bulletin, 104(1):53, 1988.
  24. IBM Corp. Released 2021. IBM SPSS Statistics for Windows, Version 28.0.1.0. Armonk, NY: IBM Corp.
  25. Limits on theory of mind use in adults. Cognition, 89(1):25–41, 2003.
  26. Theory-of-mind deficits and causal attributions. British journal of Psychology, 89(2):191–204, 1998.
  27. Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  28. Theory of mind and emotion understanding predict moral development in early childhood. British Journal of Developmental Psychology, 28(4):871–889, 2010.
  29. Ventromedial prefrontal volume predicts understanding of others and social network size. Neuroimage, 57(4):1624–1629, 2011.
  30. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  31. Bertram F Malle. How the mind explains behavior. Folk explanation, Meaning and social interaction. Massachusetts: MIT-Press, 2004.
  32. Patrick McGuiness. Gpt-4 details revealed. 12 July 2023. URL https://patmcguinness.substack.com/p/gpt-4-details-revealed.
  33. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023.
  34. Steven Mithen. The prehistory of the mind: The cognitive origins of art and science. Thames & Hudson Ltd., 1996.
  35. The emergence of recursion in human language: Mentalising predicts recursive syntax task performance. Journal of Neurolinguistics, 43:95–106, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Winnie Street (6 papers)
  2. John Oliver Siy (2 papers)
  3. Geoff Keeling (11 papers)
  4. Adrien Baranes (6 papers)
  5. Benjamin Barnett (1 paper)
  6. Michael McKibben (2 papers)
  7. Tatenda Kanyere (1 paper)
  8. Alison Lentz (3 papers)
  9. Blaise Aguera y Arcas (66 papers)
  10. Robin I. M. Dunbar (13 papers)
Citations (22)
Youtube Logo Streamline Icon: https://streamlinehq.com