Theory of Mind Reasoning
- Theory of Mind (ToM) reasoning is the ability of agents to model and update representations of others' beliefs, desires, and intentions for social interaction.
- Benchmarking studies use multi-agent tasks and metrics like retrieval accuracy to compare models under zero, finite, and infinite belief history conditions.
- Algorithmic paradigms such as recursive decomposition, symbolic tracking, and Bayesian inference enhance ToM performance and align AI reasoning with human mental processes.
Theory of Mind (ToM) Reasoning refers to an agent’s capacity to model, attribute, and update representations of others’ mental states—including beliefs, knowledge, desires, and intentions—and to use those representations for explanation, prediction, and social interaction. The computational and empirical study of ToM reasoning in artificial agents encompasses formal taxonomies of belief-history dependence, benchmark environments for multi-agent belief tracking, neuro-symbolic and statistical inference paradigms, and analyses of failure modes under scaling and real-world complexity.
1. Formal Taxonomy and Computational Foundations
A central formalization treats ToM as reasoning about agents’ belief states over time. For a multi-agent world with states , an agent possesses a belief distribution at round : with . The history of beliefs supports a taxonomy of ToM reasoning modes (Tang et al., 2024):
- Zero Belief History: The model infers current beliefs from the present context alone: . No memory or prior mental-state tracking is used.
- Finite Belief History: The model reasons over a window of length : 0 with 1.
- Infinite Belief History: The model leverages all prior beliefs 2: 3.
Agent beliefs are dynamically updated: 4, where 5 is an update function given new observations.
This taxonomy establishes the structural requirements for ToM algorithms and enables comparative evaluation across history lengths, belief recursion depth, and agent modularity.
2. Benchmarking and Evaluation Methodologies
Benchmarking ToM reasoning in artificial agents involves synthetic multi-agent games, reading comprehension tasks, persuasive dialogues, and social interaction environments.
A paradigmatic test is the “Pick the Right Stuff” multi-round benchmark (Tang et al., 2024), where LLMs must predict others’ retrieval actions from world states and observation histories. The experimental protocol compares six LLMs in both Zero and Finite Belief History conditions, under identical prompts and environments. Each prediction is scored as correct if the LLM’s predicted locker position matches the agent’s current belief.
Results are tabulated as mean retrieval accuracy—e.g., gemma:7b-instruct achieves 43.00/60 (Zero) vs. 34.33/60 (Finite), outperforming several much larger models.
Similarly, the PersuasiveToM benchmark (Yu et al., 28 Feb 2025) evaluates LLMs' ability to track desires, beliefs, and intentions in multi-turn persuasive dialogues, using a BDI (Belief-Desire-Intention) formalism. Metrics include accuracy at reasoning (state inference), application (strategy selection), and outcome judgment. Models perform well on static attributes but degrade on dynamically shifting or dialogue-level mental state tracking.
In narrative contexts, CharToM (Zhou et al., 3 Jan 2025) demonstrates that global context knowledge —deep familiarity with a character’s background and the broader text—yields substantially higher human and model accuracy on belief, intention, emotion, and desire inference. Model accuracy remains flat as local window size increases, in contrast to humans who benefit from larger context windows.
3. Algorithmic Paradigms for ToM Reasoning
Simulation and Recursive Decomposition
The Decompose-ToM framework (Sarangi et al., 15 Jan 2025) structurally simulates each agent’s perspective via recursive decomposition of multi-agent queries. Given a story 6 and a chain-structured query (e.g., “Where does A think B thinks …?”), the procedure:
- Identifies the outermost subject,
- Rewrites the question to a simpler, perspective-specific form,
- Extracts the sub-story agent 7 is aware of (using a knowledge-access function 8),
- Recurses inward until reaching a narrator-level, factual query.
This yields significant gains for higher-order false-belief tasks (e.g. GPT-4o: +40pp at 4th order over baselines), while modular knowledge-access checks improve both accuracy and interpretability.
Temporal and Symbolic Tracking
Other approaches emphasize the explicit modeling of belief state evolution:
- TimeToM (Hou et al., 2024) constructs temporal spaces with per-agent time-stamped belief chains (Temporal Belief State Chains; TBSC). These are split into self-world and social-world components, supporting formal reduction of higher-order ToM to first-order queries over “belief-communication” intervals.
- RecToM (Lei et al., 10 Jun 2026) operationalizes nested-belief reasoning by recursively constructing each agent’s perspective from the prior agent in the chain, reducing 9-th order ToM queries to zero-order factual queries within the appropriate constructed subspace.
- SymbolicToM (Sclar et al., 2023) uses plug-and-play graph-based decoders to systematically track the state and all order-0 agent beliefs, extracting only the relevant subgraph per query.
Probabilistic and Neuro-Symbolic Approaches
ToM has also been grounded in Bayesian (Pöppel et al., 2019, Zhang et al., 2 Jun 2025) and dynamic epistemic logic (DEL) frameworks (Tang et al., 2024, Wu et al., 22 May 2025):
- Bayesian ToM: Full Bayesian inverse planning over latent goals and belief states, sometimes using satisficing meta-strategies (e.g., switching between specialist models when surprise exceeds threshold) to efficiently match human performance (Pöppel et al., 2019).
- DEL-ToM: LLMs produce symbolic belief-update traces, scored at each step by a trained verifier against DEL simulation labels. This inference-time scaling allows small LLMs to achieve near-large LM ToM accuracy via search and verification, without retraining (Wu et al., 22 May 2025).
Memory-augmented models (e.g., ToMMY (Nguyen et al., 2023)) use explicit hierarchical memory for storing and selectively retrieving relevant past episodes, yielding improved false-belief and intention inference in dynamic environments.
4. Empirical Results and Model Limitations
A consistent empirical pattern is that:
- Performance drops with increased history and recursion: Across models and tasks, Zero Belief History tasks are easier than Finite, and both are easier than scenarios requiring recursive, infinite, or higher-order belief tracking. For example, Zero Belief History results in a mean score of 32.06/60 across LLMs, while Finite triggers a drop to 26.33/60 (Tang et al., 2024). On higher-order benchmarks, even GPT-4 joint accuracy drops sharply after second order, especially under deceptive communication (He et al., 2023).
- Smaller models can match or outperform larger counterparts: In belief-tracking settings, 7B-parameter models (Gemma, Mistral) outperformed 70B-parameter models, indicating that architectural inductive biases or training protocols may be more determinative of ToM performance than size alone (Tang et al., 2024).
- Real-world contextual and psychological complexity remains challenging: While LLMs can track static goals and some belief-polarity questions, they underperform at tracking shifting desires, updating nested intentions, and maintaining holistic consistency across dialogue or long-form context (Yu et al., 28 Feb 2025, Zhou et al., 3 Jan 2025).
Failure modes across benchmarks include insufficient reasoning depth (skipped recursion), temporal ignorance (event ordering errors), surface-level cue matching, shortcut heuristics, and hallucination. For high-order questions or those requiring indirect causal inference, accuracy degrades more sharply and error rates become dominated by non-systematic mistakes (He et al., 2023, Zhou et al., 3 Jan 2025).
5. Practical and Theoretical Implications
The field has established that:
- Complex ToM tasks require explicit mental-state tracking and recursive simulation. Approaches that encode or construct beliefs, either via symbolic execution (DEL, graph trackers), temporal belief chains, or perspective recursion, outperform pure monolithic neural models.
- Structured decomposition, memory augmentation, and external verifiers close performance gaps. Neuro-symbolic delegation to logic engines, selective memory retrieval, and multi-step belief trace scoring allow smaller or less-capable models to scale ToM abilities at inference time.
- Rich ToM extends beyond beliefs. Desire, intention, emotion, social norm, and action planning must be integrated for real-world and embodied multi-agent interaction (Yu et al., 28 Feb 2025, Zhang et al., 28 Nov 2025).
- Resource-rationality is essential. Causal models now specify when mentalizing is warranted (balancing information asymmetry, analytical tractability, and social complexity), codifying the cost-benefit engagement of ToM modules in real-time AI systems (Gurney, 15 Jun 2026).
6. Open Challenges and Future Directions
Research priorities highlighted across the literature include:
- Infinite Belief History and Unbounded Recursion: Developing interactive and open-ended benchmarks where agents must manage arbitrarily long or structurally unbounded belief chains (Tang et al., 2024).
- Broader Social-Cognitive Scope: Moving beyond belief to nested desires, intentions, and affective states; extending ToM to cultural, social, and multi-agent negotiation settings.
- Adaptive and Aligned Reasoning Depth: Designing agents that estimate and align with partners’ ToM order for effective coordination, especially in multi-agent or human-AI teams (Mu et al., 17 Mar 2026).
- Integration with Embodied, Multimodal Agents: Embedding ToM reasoning modules within perception–reasoning–planning pipelines, leveraging multimodal input and BDI (belief-desire-intention) hierarchies for embodied social intelligence (Zhang et al., 28 Nov 2025).
- Robustness to Perturbation and Deceptive Scenarios: Testing and enhancing ToM reasoning under perturbed, adversarial, or out-of-distribution conditions to assess genuine generalization capability (Nickel et al., 25 Feb 2026).
Methodological recommendations include the adoption of prompt-compression, perspective caching, neuro-symbolic hybrids, reinforcement and curriculum learning over multi-agent social tasks, and scaling of evaluation tasks to richer and more realistic domains.
Key References:
- “Zero, Finite, and Infinite Belief History of Theory of Mind Reasoning in LLMs” (Tang et al., 2024)
- “PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues” (Yu et al., 28 Feb 2025)
- “HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in LLMs” (He et al., 2023)
- “Decompose-ToM: Enhancing Theory of Mind Reasoning in LLMs through Simulation and Task Decomposition” (Sarangi et al., 15 Jan 2025)
- “TimeToM: Temporal Space is the Key to Unlocking the Door of LLMs' Theory-of-Mind” (Hou et al., 2024)
- “A Causal Model of Theory of Mind in Conflict for Artificial Intelligence” (Gurney, 15 Jun 2026)
- “DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic” (Wu et al., 22 May 2025)
- “Adaptive Theory of Mind for LLM-based Multi-Agent Coordination” (Mu et al., 17 Mar 2026)
- “Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in LLMs” (Nickel et al., 25 Feb 2026)