Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How FaR Are Large Language Models From Agents with Theory-of-Mind? (2310.03051v1)

Published 4 Oct 2023 in cs.CL and cs.AI

Abstract: "Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for LLMs: Thinking for Doing (T4D), which requires models to connect inferences about others' mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters' beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about mental states without being explicitly asked about as in ToMi, that lead to choosing the correct action in T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. FaR boosts GPT-4's performance from 50% to 71% on T4D, outperforming other prompting methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to diverse out-of-distribution story structures and scenarios that also require ToM inferences to choose an action, consistently outperforming other methods including few-shot in-context learning.

Evaluating Theory-of-Mind in LLMs: The Thinking for Doing Paradigm

The evaluation of Theory-of-Mind (ToM) capabilities in LLMs represents a crucial area within AI research, as it mimics a fundamental aspect of human social interaction. The paper "How FaR Are LLMs From Agents with Theory-of-Mind?" investigates this domain by introducing the "Thinking for Doing" (T4D) paradigm, a new evaluation framework for scrutinizing LLMs' ability to effectively connect mental state reasoning with strategic actions. This paper identifies and addresses a critical gap in current evaluation methods, which predominantly focus on ToM inference tasks without assessing models' capacity to act based on inferred mental states.

Introduction to T4D

The authors propose T4D as a more comprehensive evaluation approach that requires LLMs to transform mental state inferences into action decisions, capturing a realistic aspect of cognitive processing in interactive contexts. Unlike traditional benchmarks that typically end at inference, T4D compels models to decide on actions based on observational inputs. This task requires models to estimate the probability of action choices rather than inferential conclusions, thus highlighting a transition from merely understanding to actively participating in a hypothetical social scenario.

Key Challenges Identified

The T4D framework reveals significant challenges faced by LLMs in this paradigm. The paper demonstrates that, although models like GPT-4 and PaLM 2 perform well on standard ToM inference tests such as the False Belief Test, their ability to translate these inferences into actions is insufficient. The core difficulty lies in the model's capability to autonomously pinpoint relevant mental state inferences without explicit guidance, emphasizing the complexity of unconstrained decision-making environments that characterize real-world social interactions.

Enhancing LLM Performance with FaR

To address these challenges, the authors introduce a novel prompting framework, "Foresee and Reflect" (FaR), which structures the reasoning process of LLMs by prompting them to anticipate potential future scenarios and reflect on appropriate actions. The empirical results indicate that FaR significantly boosts models' performance, particularly where traditional methods fall short. GPT-4, for instance, improves its accuracy from 50% to 71% on T4D tasks when utilizing FaR, outperforming other advanced techniques such as Chain-of-Thought (CoT) and Self-Ask.

Generalization and Robustness

The robustness of FaR is further tested across diverse story structures and scenarios beyond the typical False Belief Test derivatives. Notably, FaR maintains its effectiveness across out-of-distribution tasks and novel ToM challenges (e.g., Faux Pas scenarios), demonstrating an ability to generalize beyond templated data sets. These findings suggest a promising direction for enhancing LLM capabilities in forming coherent action strategies from social reasoning, an essential advancement for future AI, particularly in applications needing social awareness and interaction such as virtual assistants or autonomous agents.

Conclusion and Future Directions

This work's implications are significant for both theoretical and practical advancements in AI. It illustrates the potential of structured prompting frameworks to enhance decision-making in LLMs, setting the stage for future studies to expand on structured reasoning processes in AI systems. A crucial next step will include examining the underlying mechanisms that allow models guided by FaR to simulate human-like reasoning and decision-making, further bridging the gap between human cognitive faculties and artificial intelligence. The T4D framework and FaR hold promise for bettering our understanding and implementation of AI that can think and act, elevating the role of ToM in artificial agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Jacob Andreas. Language models as agent models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5769–5779, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.423.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Mindcraft: Theory of mind modeling for situated dialogue in collaborative tasks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1112–1125, 2021.
  4. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46, 1985.
  5. Recognition of faux pas by normally developing children and children with asperger syndrome or high-functioning autism. Journal of autism and developmental disorders, 29:407–418, 1999.
  6. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  7. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  8. Susan T Fiske. Thinking is for doing: portraits of social cognition from daguerreotype to laserphoto. Journal of personality and social psychology, 63(6):877, 1992.
  9. Development and neurophysiology of mentalizing. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 358(1431):459–473, 2003.
  10. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  11. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  12. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
  13. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  14. Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  15. Revisiting the evaluation of theory of mind through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5872–5877, 2019.
  16. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627, 2023.
  17. Reframing Instructional Prompts to GPTk’s Language. arXiv preprint arXiv:2109.07830, 2021.
  18. Evaluating theory of mind in question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2392–2400, 2018.
  19. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022. URL https://openai.com/blog/chatgpt/.
  20. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  21. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  22. Three-year-olds’ difficulty with false belief: The case for a conceptual deficit. British journal of developmental psychology, 5(2):125–137, 1987.
  23. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526, 1978.
  24. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
  25. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https://aclanthology.org/D19-1454.
  26. Neural theory-of-mind? on the limits of social intelligence in large LMs. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3762–3780, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.248.
  27. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  28. Minding language models’(lack of) theory of mind: A plug-and-play multi-character belief tracker. arXiv preprint arXiv:2306.00924, 2023.
  29. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763, 2023a.
  30. How well do large language models perform on faux pas tests? In Findings of the Association for Computational Linguistics: ACL 2023, pp.  10438–10451, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.663. URL https://aclanthology.org/2023.findings-acl.663.
  31. Interactive query-assisted summarization via deep reinforcement learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2551–2568, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.184. URL https://aclanthology.org/2022.naacl-main.184.
  32. The consideration of future consequences: Weighing immediate and distant outcomes of behavior. Journal of personality and social psychology, 66(4):742, 1994.
  33. Do large language models know what humans know? Cognitive Science, 47(7):e13309, 2023.
  34. Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
  35. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  36. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128, 1983.
  37. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
  38. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023b.
  39. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023c.
  40. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv preprint arXiv:2205.10625, 2022.
  41. I cast detect thoughts: Learning to converse and guide with intents and theory-of-mind in dungeons and dragons. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11136–11155, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.624. URL https://aclanthology.org/2023.acl-long.624.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Pei Zhou (30 papers)
  2. Aman Madaan (30 papers)
  3. Srividya Pranavi Potharaju (5 papers)
  4. Aditya Gupta (25 papers)
  5. Kevin R. McKee (28 papers)
  6. Ari Holtzman (39 papers)
  7. Jay Pujara (44 papers)
  8. Xiang Ren (194 papers)
  9. Swaroop Mishra (60 papers)
  10. Aida Nematzadeh (24 papers)
  11. Shyam Upadhyay (22 papers)
  12. Manaal Faruqui (39 papers)
Citations (42)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com