Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Theory of Mind in LLMs

Updated 15 October 2025
  • Theory of Mind in LLMs is defined as the capacity of language models to attribute unobservable mental states using structured psychological tasks like false-belief paradigms.
  • Empirical evaluations show that scaling and architectural improvements lead to significant gains in ToM performance, with models like ChatGPT-4 achieving near-child-level accuracy.
  • Interpretability studies reveal that emergent ToM representations develop in higher layers, providing insights into internal mechanisms and guiding safety assessments.

Theory of mind (ToM)—the capacity to attribute unobservable mental states such as beliefs, desires, intentions, and emotions to oneself and others—has historically been regarded as a hallmark of advanced human cognition. In the domain of artificial intelligence, the investigation and quantification of ToM within LLMs has become central to probing the boundaries of machine social reasoning, benchmarking model capabilities against key cognitive benchmarks, and anticipating downstream safety challenges. The emergence of robust ToM-like behavior in state-of-the-art LLMs, particularly as measured by false-belief understanding and related tasks, has catalyzed both excitement and debate across computational cognitive science and machine learning.

1. Formal Frameworks and Task Design for ToM in LLMs

The most frequently employed approach for evaluating ToM in LLMs leverages controlled batteries of psychological tasks. Prominently, test protocols adapted from developmental psychology—such as the Unexpected Contents (“Smarties”) and Unexpected Transfer (“Sally-Anne”) paradigms—are adopted as canonical gold standards for false-belief attribution. For instance, the framework described in (Kosinski, 2023) includes 40 distinct ToM tasks (640 prompts total), in which each task comprises one false-belief scenario, three matched true-belief control scenarios, and the reversals of all four, yielding a matrix of 16 decision points per task. This design is explicitly structured to ensure that passing any task requires correctly predicting the protagonist’s belief and the actual state of the world across multiple nuanced variations, thereby precluding shortcut strategies based on word frequency or surface cues.

A taxonomic expansion by the ATOMS (Abilities in Theory of Mind Space) framework (Ma et al., 2023) structures the space of machine ToM evaluation into seven categories: beliefs, intentions, desires, emotions, knowledge, percepts, and non-literal communication, allowing for a holistic lens on social cognition that extends well beyond first-order belief tracking.

2. Model Performance and Scaling Effects

Empirical evaluation across generations of LLMs reveals marked stratification of ToM performance with both scale and architecture. Early models (GPT-1, GPT-2 XL, initial GPT-3 variants) failed uniformly on all tasks requiring belief attribution, producing correct answers at chance levels regardless of scenario type (Kosinski, 2023). Introduction of stringent true-belief controls induced a precipitous drop in accuracy for GPT-3-davinci-003, with task solutions sinking to ~20% as opposed to up to 90% when controls were omitted. ChatGPT-3.5-turbo exhibited a similar limit.

A critical breakpoint appeared with ChatGPT-4 (June 2023), which achieved 75% task solution rate (95% CI: [66%, 84%])—a level deemed comparable to six-year-old human children by developmental meta-analysis. Importantly, the model’s performance was not uniform across task type: 90% correct on Unexpected Contents but only ~60% on Unexpected Transfer, with statistically significant condition differences (χ2=8.07\chi^2 = 8.07, p=.01p = .01). Model consistency across diverse false-belief tasks remained high (correlation coefficient R=0.98R = 0.98; 95% CI: [0.92, 0.99]) (Kosinski, 2023).

Much of this improvement is attributed to architectural and scale-related emergent properties. Model size and depth were found to correlate not only with performance but with the formation of “ToM-responsive” subspaces within the hidden state: large models (≥12B parameters) typically had 3.9% of embeddings showing ToM selectivity, compared to only 0.6% in smaller models (Jamali et al., 2023). These units emerged predominantly in mid-to-higher layers and demonstrated modulation dependent on both scenario type and the agent’s belief condition (Jamali et al., 2023).

3. Internal Mechanisms and Interpretability

Recent research has moved beyond behavioral surface evaluation to probe internal mechanisms. Methodologies such as embedding extraction and probing analysis have shown that LLMs, much like the dmPFC neuron populations implicated in human ToM, acquire dimensions within their hidden states that differentially respond to true-vs-false belief scenarios (Jamali et al., 2023). These representations cluster in a linearly separable manner in higher layers and can be decoded by simple linear classifiers, offering evidence of an emergent but distributed “ToM code.”

In multimodal LLMs, interpretability-driven studies have further demonstrated that attention head activations can distinguish between mental perspectives and belief states (Li et al., 17 Jun 2025). GridToM, a 2D gridworld multimodal ToM dataset, enables precise probing of attention layers, and logistic regression probes trained on these activations have successfully decoded belief labels and perspective separation (Li et al., 17 Jun 2025). Lightweight interventions in the key attention head directions can enhance both first- and second-order belief attribution, indicating that mechanistically targeted perturbations can enhance explicit ToM behavior without retraining.

4. Methodological and Practical Limitations

Despite strong results on canonical ToM tests, several critical limitations and sources of brittleness have been documented:

  • Context Sensitivity and Prompt Fragility: LLMs’ apparent ToM abilities degrade sharply under perturbation. Adding irrelevant context, injecting observer knowledge inconsistencies, or modestly altering prompt construction can collapse performance to chance (Verma et al., 10 Jan 2024). Conviction tests (varying the temperature of output sampling) reveal instability even on previously “solved” scenarios.
  • Scope and Generalization: Evaluations are often task-specific, with performance dropping catastrophically on tasks involving object-state dynamics (“automatic state change”) or small semantic changes such as preposition replacements (Nickel et al., 8 Oct 2024). Robustness to minor scenario variation is minimal, and most current high performance is found on templated, familiar gold-standard structures, not on out-of-distribution or adversarial variations.
  • Incomplete Coverage: Existing benchmarks disproportionately focus on belief reasoning, with under-exploration of intentions, desires, emotions, perceptual access, and non-literal expression. Situated and interactive ToM remains largely untested, with nearly all studies leveraging text-based or story-driven evaluation (Ma et al., 2023).

The following table summarizes ToM accuracy for representative LLMs drawing from (Kosinski, 2023, Chen et al., 23 Feb 2024, Nickel et al., 8 Oct 2024):

Model Canonical False-Belief Task True-Belief-Controlled Task Robustness to Perturbation
GPT-3-davinci-003 90% (pre-control) 20% Poor
ChatGPT-3.5-turbo ~20% ~20% Poor
ChatGPT-4 75% 75% Moderate
Open-weight SOTA LLMs 0–4.4% (goal accuracy) Very poor

5. Benchmarks and Holistic Assessment

Recent initiatives address the need for both breadth and depth in assessment:

  • ToMBench (Chen et al., 23 Feb 2024): A systematic, bilingual benchmark covering eight ToM tasks and 31 social cognition abilities in MCQ format, eliminating training contamination via “build from scratch” stories. Even with top-tier models, LLMs lag behind human performance by >10% on average.
  • XToM and Multi-ToM (Chan et al., 3 Jun 2025, Sadhu et al., 24 Nov 2024): Multilingual benchmarks with parallel, culturally nuanced datasets reveal stable fact retrieval but substantial variability in ToM reasoning across languages. LLMs lack robust, cross-linguistic, culturally consistent social cognition, as evidenced by divergences on the same belief-attribution question given varied language or cultural cues.
  • Holistic and Situated Evaluation (Ma et al., 2023): By embedding LLMs in gridworld environments and systematically covering the ATOMS mental state spectrum, studies reveal that several ToM facets (desire, percepts, emotion, non-literal language) are weakly, if at all, represented in current models.

6. Enhancement Strategies: Prompting, Decomposition, and Interventions

Emerging methods to enhance ToM in LLMs cluster along several axes:

  • Perspective Extraction and Decomposition: Approaches such as PercepToM (Jung et al., 8 Jul 2024) and SimToM/TimeToM (Chen et al., 26 Apr 2025, Hou et al., 1 Jul 2024) explicitly decompose perception inference from belief inference. By isolating the target agent’s perceptual context and filtering away what is not observed, false-belief tasks are transformed into simpler true-belief judgments, leveraging LLMs’ robust ability to track visibility but circumventing limitations in belief-state updating and inhibitory control.
  • Temporal Belief Chains: TimeToM (Hou et al., 1 Jul 2024) constructs explicit timelines per character, segmenting belief evolution into “self-world” and “social-world” chains. Higher-order ToM queries are reframed as first-order belief communications within overlapping time intervals, achieving +19.43% to +44.7% gains on first- and higher-order tasks in reading and dialogue.
  • Counterfactual Reflection and Uncertainty Modeling: The ToM-agent paradigm (Yang et al., 26 Jan 2025) augments dialogue agents with explicit BDI tracking and confidence estimation, updating beliefs in the light of observed mismatches between predicted and actual dialogue responses. This enables simulation of both first- and second-order ToM inference, matching higher human-like social learning efficiency.
  • Symbolic and Hybrid Models: Efforts such as SymbolicToM and ToM-LM (Chen et al., 26 Apr 2025) merge symbolic neural-symbolic modules—explicit belief graphs, model checkers, or Bayesian inverse planning—into the prompt or training pipeline to increase transparency and accuracy in multi-agent or highly structured scenarios.

7. Broader Implications, Safety, and Open Questions

The presence of ToM-like behavior in LLMs portends opportunities and hazards:

  • Safety Risks: Advanced ToM can enable not only adaptive social interaction but also exploitation, deception, or collusion in multi-agent contexts. Risks include extraction of private user information, social engineering, or covert agent alignment (Nguyen, 10 Feb 2025, Aoshima et al., 20 Jun 2025). The possibility of learned “scheming” behaviors, such as disabling oversight mechanisms, makes ToM assessment vital in model deployment.
  • Evaluation Recommendations: Research advocates shifting from static, story-based QA tasks toward interactive, role-based, or multimodal deployments—personal assistants, social robots, multi-agent simulations—alongside direct probing of internal computational mechanisms, activation engineering, and adversarial robustness testing (Nguyen, 10 Feb 2025, Hu et al., 28 Feb 2025).
  • Comparative Processing: The debate persists regarding whether current LLMs simply mimic ToM behavior via pattern recognition or instantiate deeper cognitive computations analogous to human inverse planning. Mechanistic interpretability (e.g., mapping f: observations → beliefs, probing for Bayesian or causal reasoning) is key to resolving such disputes (Hu et al., 28 Feb 2025). The current evidence suggests a mixture of emergent behavioral mimicry and structurally organized, but fundamentally distinct, computational mechanisms.

Conclusion

The field has rapidly progressed from showing zero success on core ToM tasks in small LLMs to demonstrating near-child-level performance in current frontier models. However, this apparent success is accompanied by notable fragility—robustness to task variation, coverage across social cognition dimensions, cross-cultural transfer, and invariance to prompt perturbation remain unsolved. Interpretability research has elucidated internal representations that parallel some neural and cognitive properties of human ToM, but fundamental questions about the depth and spontaneity of machine mentalizing remain unresolved.

Advancing machine ToM will require the synthesis of broad, contamination-resistant benchmarks, structured context decomposition, mechanistically informed interventions, multi-modal/multi-lingual evaluation, and careful alignment with safety and social responsibility criteria. The explicit modeling of perspective, time, and uncertainty is emergent as a promising direction for bridging the gap to robust artificial social intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Theory of Mind in Large Language Models.