Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks (2302.08399v5)

Published 16 Feb 2023 in cs.AI and cs.CL

Abstract: Intuitive psychology is a pillar of common-sense reasoning. The replication of this reasoning in machine intelligence is an important stepping-stone on the way to human-like artificial intelligence. Several recent tasks and benchmarks for examining this reasoning in Large-Large Models have focused in particular on belief attribution in Theory-of-Mind tasks. These tasks have shown both successes and failures. We consider in particular a recent purported success case, and show that small variations that maintain the principles of ToM turn the results on their head. We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical, and that outlying failure cases should outweigh average success rates. We also consider what possible future successes on Theory-of-Mind tasks by more powerful LLMs would mean for ToM tasks with people.

Authors (1)

Tomer Ullman (12 papers)

Citations (177)

View on Semantic Scholar

Summary

Essay: An Examination of Theory-of-Mind Tasks and LLM Performance

The paper "LLMs Fail on Trivial Alterations to Theory-of-Mind Tasks" by Tomer D. ULLMan provides a critical evaluation of the purported extent to which LLMs exhibit Theory-of-Mind (ToM) capabilities. Grounded in psychological expertise, the paper investigates specific ToM tasks and examines variations that expose current limitations in LLM performance. It argues for a skeptical approach to claims of spontaneous ToM emergence in AI given observable shortcomings when subjected to minimal variations in standardized tasks.

One of the focal points is the comparison with recent studies suggesting that LLMs can solve ToM tasks akin to the cognitive abilities of nine-year-old children. ULLMan counters this narrative by illustrating the fragility of these models under simple perturbations that theoretically should not affect a genuine understanding of ToM. The claim that GPT-3.5, a leading model, achieves ToM competence is systematically dismantled through carefully orchestrated task variations, including changes in perceptual access, alterations to direct testimony, and engagement with reasoning involving multiple agents. These tasks aim to mimic classic unexpected contents and transfer scenarios central to assessing ToM, such as the "smarties task" and "Sally-Anne task."

The empirical results of imbued alterations were telling—variations unaffected by human knowledge semantics yielded LLM failures. For example, changing an object's state from contained within (in) to being on top (on) or providing transparent perceptual access to the contents of a container drastically altered model predictions contrary to expectations based on common-sense reasoning. Moreover, when honest communication of state changes between agents was introduced, the model still inaccurately predicted mental states, reinforcing the absence of genuine ToM processing.

ULLMan extends his critique to warnings about anthropomorphizing AI systems based on surface-level task success. These LLMs might exhibit patterns that fit certain benchmarks without truly embodying the underlying cognitive processes, a phenomenon sometimes observed through training-induced biases rather than innate inferential capabilities. Therefore, prematurely attributing mental state understanding to LLMs could mislead not only AI research but also our grasp of the benchmarks themselves.

The implications raised in this paper cut to the heart of ongoing debates in AI regarding functional transparency and generalization. It raises important points about the methodology for evaluating AI models in ToM contexts, cautioning against benchmarking practices that may lead to overgeneralization of a model's abilities based on limited task performance. While ULLMan acknowledges the potential for AI to ultimately replicate aspects of human cognitive abilities, he suggests integrating formal ToM models developed in cognitive science into LLMs as a way forward instead of relying solely on data techniques that, at present, fail to encapsulate true cognitive understanding.

Moreover, the paper initiates a philosophical discussion on evaluating ToM, emphasizing that the simplistic Turing-Test-like scenarios often leveraged can be insufficient. It postulates using algorithmic clarity over behavioral mimicry as an assessment tool, acknowledging behavioral outcomes as insufficient proxies for validating the complex cognitive competence witnessed in humans.

In conclusion, while the possibility remains that advanced LLMs could develop more robust ToM capabilities, ULLMan's paper insists on rigorous scrutiny and measured scientific discourse about current and future LLM achievements. It advocates for a recalibration of evaluation metrics to more accurately represent AI's capabilities related to human-like reasoning. Consequently, the paper is an essential contribution to structuring future research directions concerning the interplay of intuitive psychology and computational linguistics, thus calling for careful advancement to avoid premature endorsements of machine reasoning competencies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tombszabo/status/1793510425981772180

YouTube

Show All Videos