Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ToMBench: Benchmarking Theory of Mind in Large Language Models (2402.15052v2)

Published 23 Feb 2024 in cs.CL and cs.AI

Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether LLMs exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. KOLMOGOROV AN. 1933. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att, 4:89–91.
  3. James N Aronson and Claire Golomb. 1999. Preschoolers’ understanding of pretense and presumption of congruity between action and representation. Developmental Psychology, 35(6):1414.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. Baichuan-Inc. 2023. Baichuan 2. Online.
  6. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46.
  7. Recognition of faux pas by normally developing children and children with asperger syndrome or high-functioning autism. Journal of autism and developmental disorders, 29:407–418.
  8. Systematic review and inventory of theory of mind measures for young children. Frontiers in psychology, 10:2905.
  9. Mark Bennett and Linda Galpert. 1993. Children’s understanding of multiple desires. International Journal of Behavioral Development, 16(1):15–33.
  10. Helene Borke. 1971. Interpersonal perception of young children: Egocentrism or empathy? Developmental psychology, 5(2):263.
  11. Sandra Bosacki and Janet Wilde Astington. 1999. Theory of mind in preadolescence: Relations between social understanding and social competence. Social development, 8(2):237–255.
  12. Cross-cultural differences in adult theory of mind abilities: a comparison of native-english speakers and native-chinese speakers on the self/other differentiation task. Quarterly Journal of Experimental Psychology, 71(12):2665–2676.
  13. Michael Brambring and Doreen Asbrock. 2010. Validity of false belief tasks in blind children. Journal of Autism and Developmental Disorders, 40:1471–1484.
  14. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  15. Longitudinal effects of theory of mind on later peer relations: the role of prosocial behavior. Developmental psychology, 48(1):257.
  16. Stephanie M Carlson and Louis J Moses. 2001. Individual differences in inhibitory control and children’s theory of mind. Child development, 72(4):1032–1053.
  17. Precursors of a theory of mind: A longitudinal study. British Journal of Developmental Psychology, 26(4):561–577.
  18. Schizophrenia, symptomatology and social inference: investigating “theory of mind” in people with schizophrenia. Schizophrenia research, 17(1):5–13.
  19. Jean Decety and Philip L Jackson. 2004. The functional architecture of human empathy. Behavioral and cognitive neuroscience reviews, 3(2):71–100.
  20. Susanne A Denham. 1986. Social cognition, prosocial behavior, and emotion in preschoolers: Contextual validation. Child development, pages 194–201.
  21. Do autism spectrum disorders differ from each other and from non-spectrum disorders on emotion recognition tests? European child & adolescent psychiatry, 10:105–116.
  22. Development of knowledge about the appearance-reality distinction. Monographs of the society for research in child development, pages i–87.
  23. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
  24. Noah D Goodman and Andreas Stuhlmüller. 2013. Knowledge and implicature: Modeling language understanding as social cognition. Topics in cognitive science, 5(1):173–184.
  25. Felice W Gordis et al. 1989. Young children’s understanding of simultaneous conflicting emotions.
  26. Francesca GE Happé. 1994. An advanced test of theory of mind: Understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. Journal of autism and Developmental disorders, 24(2):129–154.
  27. Children’s understanding of the distinction between real and apparent emotion. Child development, pages 895–909.
  28. Ignorance versus false belief: A developmental lag in attribution of epistemic states. Child development, pages 567–582.
  29. Mistral 7b. arXiv preprint arXiv:2310.06825.
  30. Epitome: Experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents.
  31. The accidental transgressor: Morally-relevant theory of mind. Cognition, 119(2):197–215.
  32. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In EMNLP, pages 14397–14413.
  33. Theory-of-mind deficits and causal attributions. British journal of Psychology, 89(2):191–204.
  34. Annotation error detection: Analyzing the past and present for a more coherent future. Computational Linguistics, 49(1):157–198.
  35. Empathy in early childhood: Genetic, environmental, and affective contributions. Annals of the New York Academy of Sciences, 1167(1):103–114.
  36. Anna M Kołodziejczyk and Sandra L Bosacki. 2016. Young-school-aged children’s use of direct and indirect persuasion: role of intentionality understanding. Psychology of Language and Communication, 20(3):292–315.
  37. Michal Kosinski. 2023. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083.
  38. Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10(1):1096.
  39. Revisiting the evaluation of theory of mind through question answering. In EMNLP.
  40. Changmao Li and Jeffrey Flanigan. 2023. Task contamination: Language models may not be few-shot anymore. arXiv preprint arXiv:2312.16337.
  41. Tomchallenges: A principle-guided dataset and diverse evaluation tasks for exploring theory of mind. arXiv preprint arXiv:2305.15068.
  42. Towards a holistic landscape of situated theory of mind in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1011–1031.
  43. Andrew N Meltzoff. 1995. Understanding the intentions of others: Re-enactment of intended acts by 18-month-old children. Developmental psychology, 31(5):838.
  44. Mistral AI. 2023. Mixtral of experts: A high quality sparse mixture-of-experts. Online.
  45. Infants determine others’ focus of attention by pragmatics and exclusion. Journal of Cognition and Development, 7(3):411–430.
  46. OpenAI. 2023a. Gpt-3.5-turbo-0613: Function calling, 16k context window, and lower prices. Online.
  47. OpenAI. 2023b. New models and developer products announced at devday. Online.
  48. Josef Perner and Heinz Wimmer. 1985. “john thinks that mary thinks that…” attribution of second-order beliefs by 5-to 10-year-old children. Journal of experimental child psychology, 39(3):437–471.
  49. Keeping the reader’s mind in mind: development of perspective-taking in children’s dictations. Journal of applied developmental psychology, 35(1):35–43.
  50. Infants’ ability to connect gaze and emotional expression to intentional action. Cognition, 85(1):53–78.
  51. Bradford H Pillow. 1989. Early understanding of perception as a source of knowledge. Journal of experimental child psychology, 47(1):116–129.
  52. Francisco Pons and Paul Harris. 2000. Test of emotion comprehension: TEC. University of Oxford.
  53. David Premack and Guy Woodruff. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, pages 515–526.
  54. François Quesque and Yves Rossetti. 2020. What do theory-of-mind tasks actually measure? theory and practice. Perspectives on Psychological Science, 15(2):384–396.
  55. Betty M Repacholi and Alison Gopnik. 1997. Early reasoning about desires: evidence from 14-and 18-month-olds. Developmental psychology, 33(1):12.
  56. Neural theory-of-mind? on the limits of social intelligence in large lms. ArXiv.
  57. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
  58. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763.
  59. Herbert A Simon and Herbert A Simon. 1977. Spurious correlation: A causal interpretation. Springer.
  60. Theory of mind and peer acceptance in preschool children. British journal of developmental psychology, 20(4):545–564.
  61. Patricia A Smiley. 2001. Intention understanding and partner-sensitive behaviors in young children’s peer interactions. Social Development, 10(3):330–354.
  62. How children tell a lie from a joke: The role of second-order mental state attributions. British journal of developmental psychology, 13(2):191–204.
  63. J Swettenham. 1996. Can children be taught to understand false belief using computers? child psychology & psychiatry & allied disciplines, 37 (2), 157–165.
  64. THUDM. 2023. Chatglm3. Online.
  65. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  66. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399.
  67. Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7-10 on advanced tests. arXiv preprint arXiv:2310.20320.
  68. Henry M Wellman and Karen Bartsch. 1988. Young children’s reasoning about beliefs. Cognition, 30(3):239–277.
  69. Frank Wilcoxon. 1947. Individual comparisons of grouped data by ranking methods.
  70. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. arXiv preprint arXiv:2311.10227.
  71. Heinz Wimmer and Josef Perner. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128.
  72. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706.
  73. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882.
  74. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv:2310.03051.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Zhuang Chen (13 papers)
  2. Jincenzi Wu (5 papers)
  3. Jinfeng Zhou (15 papers)
  4. Bosi Wen (8 papers)
  5. Guanqun Bi (11 papers)
  6. Gongyao Jiang (4 papers)
  7. Yaru Cao (4 papers)
  8. Mengting Hu (20 papers)
  9. Yunghwei Lai (3 papers)
  10. Zexuan Xiong (2 papers)
  11. Minlie Huang (225 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com