Theory of Mind Tests

Updated 27 January 2026

Theory of Mind tests are evaluations designed to assess the capacity of biological and artificial systems to attribute and update mental states such as beliefs, desires, and emotions.
They range from classical false-belief paradigms like the Sally-Anne test to advanced, multimodal benchmarks that require recursive and spatial reasoning.
Key metrics including accuracy, macro-F1, and goal consistency reveal persistent performance gaps between current models and human reasoning on higher-order tasks.

Theory of Mind (ToM) tests encompass a technically and methodologically diverse set of evaluations designed to assess the capacity of systems—biological or artificial—to attribute, track, and update beliefs, desires, intentions, and emotions of others, particularly in the presence of partial observability or informational asymmetry. The concept of ToM is central in cognitive science, developmental psychology, artificial intelligence, and neurocomputational modeling. Test paradigms have evolved from classic dyadic, text-only false-belief tasks to advanced, multimodal, and interactive settings, with sophisticated metrics and fine-grained error analyses.

1. Classical and Foundational ToM Test Paradigms

The majority of foundational ToM benchmarks derive from developmental psychology’s false-belief paradigms, notably the Sally-Anne and Smarties tests. These require participants to infer what another agent believes about the world when that belief is distinct from reality or the observer’s own knowledge. In formal terms, first-order ToM probes whether an agent $A$ holds a belief state $B_A(p)$ about a proposition $p$ (e.g., "Sally thinks the marble is in box A"), while higher-order ToM queries embed these recursively: $B_{A_2}[B_{A_1}(p)]$ ("Anne thinks Sally thinks...") (He et al., 2023).

The Sally-Anne paradigm has underpinned many early AI benchmarks, serving as a minimal diagnostic for false-belief reasoning (Alon et al., 31 Mar 2025). However, recent evaluations have emphasized that passing such tests does not constitute robust or general ToM, as models may rely on superficial cues or dataset artifacts.

2. Expansion to Multidimensional and Multimodal ToM Benchmarks

Modern ToM test development has addressed key limitations of classic paradigms by:

Expanding narrative complexity and breadth: OpenToM replaces short, templated stories with longer, intention-driven narratives featuring explicit character personalities, intention-motivated actions, and questions probing both physical and psychological mental states. Its 23-question taxonomy spans location tracking, multi-step ("multi-hop") reasoning, and attitudinal/social inference, revealing that current LLMs, while competent at basic physical beliefs, remain substantially below human performance on psychological or norm-based tasks (Xu et al., 2024).
Introducing multimodal and spatial reasoning: Benchmarks such as ToM-SSI and SoMi-ToM present agents operating and communicating within grid-worlds or simulated embodied environments. These tests demand multimodal (text, image, and spatial) integration and track performance on tasks requiring perspective-taking under movement constraints, group dynamics, and mixed cooperative-obstructive attitudes (Bortoletto et al., 5 Sep 2025, Fan et al., 29 Jun 2025). Metrics such as percept, belief, and intention accuracy—and their joint combinations—quantify cascading difficulties and pinpoint bottlenecks in model failure modes.
Probing robustness and theory-of-mind generalization: Explicit perturbation frameworks and complexity-graded datasets (e.g., Nickel et al.'s complexity classes) are used to reveal model fragility. Scenarios introducing transparent vs. opaque containers, automatic state changes, or uninformative labels test whether models can maintain invariant belief attribution under non-trivial manipulations (Nickel et al., 2024, Verma et al., 2024).

3. Higher-Order, Recursive, and Interaction-Based ToM Tests

ToM’s essential recursive character is stressed in higher-order tests such as HI-TOM, which evaluates first- through fourth-order embedding in belief reasoning (e.g., "Where does Alex think Sally thinks Anne thinks ...?"). Results show a steep accuracy decline as order increases, even for SOTA LLMs (e.g., GPT-4 drops from 98% at order 0 to 5% at order 4), with key failure modes including incomplete recursion, temporal confusion, and causal hallucination (He et al., 2023).

Benchmarks such as FANToM and PersuasiveToM embed ToM assessments in multi-agent conversational contexts—dialogue games with information asymmetry, recursive beliefs about others' beliefs, and shifting strategic intentions. FANToM emphasizes non-merging of knowledge and multi-format consistency: correct answers must generalize across freeform, multiple-choice, and list-based belief and fact questions. PersuasiveToM, leveraging a BDI (belief-desire-intention) annotation schema, quantifies the difficulty of tracking evolving mental states across persuasive, adversarial dialogues (Kim et al., 2023, Yu et al., 28 Feb 2025).

4. Metrics, Evaluation Schemes, and Interpretability

ToM evaluation metrics have evolved from simple accuracy to more nuanced measures:

Raw and joint accuracy: For subtasks (percept, belief, intention) and their combinations (e.g., PBI accuracy) to capture dependency chains in inference (Bortoletto et al., 5 Sep 2025).
Macro-F1 and human agreement: Used, for example, in OpenToM, to assess multi-class reasoning stability and alignment with human annotators (Xu et al., 2024).
Turn and goal accuracy: For complex, multi-question scenes in robustness suites, with "goal" accuracy requiring perfect consistency across all scene sub-questions—a stringent bottleneck for current LLMs (Nickel et al., 2024).
Head-wise attention and linear probing: In multimodal ToM tests for interpretability, the internal separation of belief/perspective information is quantified via logistic probes trained on activation vectors of Transformer attention heads (Li et al., 17 Jun 2025).
Consistency and robustness: Metrics such as conviction (consistency across runs) and invariance under irrelevant context perturbation have become crucial for establishing "genuine" ToM, distinguishing semantic reasoning from pattern-matching (Verma et al., 2024).

5. Principal Findings and Model Deficiencies

Empirical analysis of SOTA LLMs and multimodal models reveals:

Degenerate scaling: While larger models can excel at 0th/1st-order reasoning and solve classic false-belief tasks with high reliability, performance on higher-order, multi-turn, or adversarial settings often degrades rapidly. Fidelity to ground-truth belief states collapses as logical depth, scenario complexity, or context length increases (He et al., 2023, Duijn et al., 2023, Getachew et al., 23 Jun 2025).
Failure on dynamic, group, or spatial ToM: Multimodal benchmarks consistently show LLMs and VLMs underperforming in tasks requiring integration of spatial, visual, and multi-agent cues, with performance gaps exceeding 25–40% vs. humans (Bortoletto et al., 5 Sep 2025, Fan et al., 29 Jun 2025).
Prompt, template, and surface-form sensitivity: Model success is often contingent on specific question/prompt templates; variance across formats or narrative paraphrases frequently signals shortcut learning or lexically anchored heuristics (Ma et al., 2023, Thiyagarajan et al., 11 Jun 2025).
Robustness gaps: Perturbations that are semantically neutral for humans (irrelevant context, order of events, uninformative details) can significantly reduce model accuracy, revealing the superficiality of underlying representations (Verma et al., 2024, Nickel et al., 2024).
Inconsistent reasoning and lack of multi-modal grounding: Models frequently fail to maintain coherent or invariant answers across logically equivalent formats or maintain consistent belief updates across conversational turns or narrative windows (Kim et al., 2023, Bortoletto et al., 5 Sep 2025).

6. Testbeds for Rich Narrative and Contextual ToM

Cutting-edge evaluations have introduced large-scale, context-rich ToM testbeds:

OpenToM and CharToM: OpenToM uses long, human-revised, personality-anchored narratives with diverse, finely annotated question types to approximate the depth of real-world ToM, showing stark deficits in LLM tracking of psychological and norm-based beliefs (Xu et al., 2024). CharToM leverages classic novel texts, showing that both humans and LLMs perform better with increased global narrative context, but that humans outperform LLMs even when the latter have had access to entire works during pretraining—highlighting fundamental limitations in current ToM reasoning (Zhou et al., 3 Jan 2025).
ToMBench and derivatives: ToMBench presents a systematically constructed, contamination-controlled, multi-task assessment suite measuring 31 fine-grained social-cognitive abilities. High-parameter LLMs achieve strong but still subhuman accuracy, with largest deficits in scalar implicature, second-order beliefs, knowledge-pretend links, and story-level logical coherence (Chen et al., 2024, Thiyagarajan et al., 11 Jun 2025).

7. Future Directions and Methodological Recommendations

Best practices for rigorous ToM test development include:

Multi-format, multi-perspective, and multi-modal designs: Use of diverse narrative, conversational, visual, and action-based question formats avoids overfitting to narrow linguistic patterns (Li et al., 17 Jun 2025, Fan et al., 29 Jun 2025, Mukherjee et al., 2024).
Explicit benchmarking of higher-order, multi-agent, and dynamic ToM: Recursive, adversarial, and dialogue-based settings ensure that models cannot rely on default assumptions or single-shot inference (He et al., 2023, Yu et al., 28 Feb 2025).
Fine-grained error annotation: Systematic attribution of errors—e.g., recursion truncation, temporal order confusion, entity mis-tracking, reliance on heuristics (recency, surface overlap)—informs both training and architectural innovations (Nickel et al., 2024, Getachew et al., 23 Jun 2025).
Psychometric sophistication: Exploratory and confirmatory factor analysis, as in experimental ToM batteries (e.g., RPS-game analyses), clarifies latent structure and cross-domain influences among cognitive, spatial, and emotional ToM capacities (Nguyen et al., 7 Nov 2025).
Human-comparative, explainability-focused evaluation: Inclusion of human baselines (e.g., adult or child, with/without narrative exposure) and internal mechanism analysis via attention or prompt decomposition benchmarks model advances on robust, explanatory grounds (Li et al., 17 Jun 2025, Duijn et al., 2023).

Across all lines of research, the persistent performance gap relative to humans—especially on tasks requiring dynamic, multi-modal, or group-level inference—underscores the need for richer architectures, new training regimes, and principled, scalable benchmarks to advance machine Theory of Mind.