Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs (2410.13648v1)

Published 17 Oct 2024 in cs.CL and cs.AI

Abstract: While prior work has explored whether LLMs possess a "theory of mind" (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. We create a new dataset, SimpleTom, containing concise, diverse stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state ("Is Mary aware of the mold?"), (b) behavior ("Will Mary pay for the chips or report the mold?"), and (c) judgment ("Mary paid for the chips. Was that reasonable?"). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable (c), despite being correctly aware of the protagonist's mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and (c) via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment.

Insights into SimpleToM: Evaluating Theory of Mind Capabilities in LLMs

The paper "SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs" presents a novel approach to evaluating Theory of Mind (ToM) reasoning in LLMs. The work introduces a dataset, SimpleToM, designed to assess the models' ability to infer and apply knowledge of mental states through concise stories. This paper provides a comprehensive examination of explicit and applied ToM in LLMs, revealing significant insights into current model capabilities.

Dataset Design and Methodology

SimpleToM comprises 1147 stories accompanied by questions testing three levels of ToM reasoning: mental state inference, behavior prediction, and judgment of behavior. The dataset aims to extend beyond traditional ToM evaluations like the Sally-Anne task by encompassing a diverse range of scenarios where information asymmetry arises naturally. Stories are formatted to encourage models to infer mental states without explicit cues, thus testing the models' implicit reasoning abilities.

The stories were generated using multiple LLMs, with rigorous filtering by human annotators to ensure quality. Each story is accompanied by three questions targeting explicit and applied ToM, with focus on mental state awareness and behavior prediction.

Key Findings

Performance Discrepancies

The evaluation of ten frontier LLMs on SimpleToM highlights a notable discrepancy between the models' capabilities in explicit and applied ToM tasks. While most models proficiently inferred mental states, evidenced by high accuracies in mental state questions, their performance significantly declined in predicting behavior and judging behavior appropriateness. For example, models like GPT-4o achieved over 95% accuracy in mental state inference but dropped to 49.5% in behavior prediction.

Influences of Intervention

The paper explores interventions like mental state reminders and chain-of-thought (CoT) prompting to improve model performance on applied ToM tasks. While these interventions substantially boosted scores (e.g., GPT-4o's behavior prediction accuracy increased to 82.8% with intervention), the need for such aids underscores a gap in models' natural reasoning capabilities.

Scenario-Specific Performance

Performance varied across scenarios, indicating that certain contexts might inherently challenge models more. For instance, in "provider info healthcare" scenarios, even lower-performing models achieved relatively better results, suggesting that model training on safety topics might influence capabilities.

Implications and Future Directions

The research suggests that while LLMs display competent explicit ToM reasoning, there is a critical need for models that can autonomously apply ToM insights without intervention. This highlights an essential area of focus for AI development, particularly for applications requiring intuitive social reasoning.

The introduction of SimpleToM opens pathways for further exploration into the complexities of ToM in AI. It suggests potential improvements in model architecture or training that account for nuanced reasoning tasks. Future research may leverage the dataset to explore interactions between scenario types, levels of reasoning, and model training methods.

Overall, the paper provides a detailed picture of the current limitations and potential directions for developing more socially aware AI systems. The insights from SimpleToM are vital for understanding the broader implications of deploying LLMs in environments that demand nuanced human-like reasoning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuling Gu (16 papers)
  2. Oyvind Tafjord (49 papers)
  3. Hyunwoo Kim (52 papers)
  4. Jared Moore (12 papers)
  5. Ronan Le Bras (56 papers)
  6. Peter Clark (108 papers)
  7. Yejin Choi (287 papers)