Do LLMs possess deployable causal models for Theory of Mind

Determine whether large language models have actually learned deployable causal models that can be applied in arbitrary settings to support Theory-of-Mind reasoning, rather than merely mimicking Theory-of-Mind behavior from patterns in their pretraining data.

Background

Theory of Mind involves representing oneself and others as agents with knowledge, intentions, and beliefs. Because such behavior is ubiquitous in human-generated text, LLMs are exposed to many examples during pretraining and may reproduce Theory-of-Mind-like outputs via pattern matching.

This paper introduces a behavior-based evaluation designed to move beyond verbal description and assess whether models act using internal mental-state representations. The central open question motivating this approach is whether LLMs have learned causal models that they can deploy in arbitrary contexts, rather than relying on mimicry.

References

Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear.