How FaR Are Large Language Models From Agents with Theory-of-Mind? (2310.03051v1)

Published 4 Oct 2023 in cs.CL and cs.AI

Abstract: "Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for LLMs: Thinking for Doing (T4D), which requires models to connect inferences about others' mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters' beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about mental states without being explicitly asked about as in ToMi, that lead to choosing the correct action in T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. FaR boosts GPT-4's performance from 50% to 71% on T4D, outperforming other prompting methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to diverse out-of-distribution story structures and scenarios that also require ToM inferences to choose an action, consistently outperforming other methods including few-shot in-context learning.

References (41)

Citations (42)

View on Semantic Scholar

Summary

The paper introduces the T4D framework, a novel approach to assess LLMs’ ability to transform Theory-of-Mind inferences into actionable decisions.
It shows that while models perform well on standard ToM tests, they struggle with autonomous action decisions without directed guidance.
The proposed FaR prompting notably boosts performance, improving GPT-4's accuracy from 50% to 71% and demonstrating strong generalization.

Evaluating Theory-of-Mind in LLMs: The Thinking for Doing Paradigm

The evaluation of Theory-of-Mind (ToM) capabilities in LLMs represents a crucial area within AI research, as it mimics a fundamental aspect of human social interaction. The paper "How FaR Are LLMs From Agents with Theory-of-Mind?" investigates this domain by introducing the "Thinking for Doing" (T4D) paradigm, a new evaluation framework for scrutinizing LLMs' ability to effectively connect mental state reasoning with strategic actions. This paper identifies and addresses a critical gap in current evaluation methods, which predominantly focus on ToM inference tasks without assessing models' capacity to act based on inferred mental states.

Introduction to T4D

The authors propose T4D as a more comprehensive evaluation approach that requires LLMs to transform mental state inferences into action decisions, capturing a realistic aspect of cognitive processing in interactive contexts. Unlike traditional benchmarks that typically end at inference, T4D compels models to decide on actions based on observational inputs. This task requires models to estimate the probability of action choices rather than inferential conclusions, thus highlighting a transition from merely understanding to actively participating in a hypothetical social scenario.

Key Challenges Identified

The T4D framework reveals significant challenges faced by LLMs in this paradigm. The paper demonstrates that, although models like GPT-4 and PaLM 2 perform well on standard ToM inference tests such as the False Belief Test, their ability to translate these inferences into actions is insufficient. The core difficulty lies in the model's capability to autonomously pinpoint relevant mental state inferences without explicit guidance, emphasizing the complexity of unconstrained decision-making environments that characterize real-world social interactions.

Enhancing LLM Performance with FaR

To address these challenges, the authors introduce a novel prompting framework, "Foresee and Reflect" (FaR), which structures the reasoning process of LLMs by prompting them to anticipate potential future scenarios and reflect on appropriate actions. The empirical results indicate that FaR significantly boosts models' performance, particularly where traditional methods fall short. GPT-4, for instance, improves its accuracy from 50% to 71% on T4D tasks when utilizing FaR, outperforming other advanced techniques such as Chain-of-Thought (CoT) and Self-Ask.

Generalization and Robustness

The robustness of FaR is further tested across diverse story structures and scenarios beyond the typical False Belief Test derivatives. Notably, FaR maintains its effectiveness across out-of-distribution tasks and novel ToM challenges (e.g., Faux Pas scenarios), demonstrating an ability to generalize beyond templated data sets. These findings suggest a promising direction for enhancing LLM capabilities in forming coherent action strategies from social reasoning, an essential advancement for future AI, particularly in applications needing social awareness and interaction such as virtual assistants or autonomous agents.

Conclusion and Future Directions

This work's implications are significant for both theoretical and practical advancements in AI. It illustrates the potential of structured prompting frameworks to enhance decision-making in LLMs, setting the stage for future studies to expand on structured reasoning processes in AI systems. A crucial next step will include examining the underlying mechanisms that allow models guided by FaR to simulate human-like reasoning and decision-making, further bridging the gap between human cognitive faculties and artificial intelligence. The T4D framework and FaR hold promise for bettering our understanding and implementation of AI that can think and act, elevating the role of ToM in artificial agents.