Higher-Order ToM in LLMs

Updated 3 July 2025

Higher-order Theory of Mind is the cognitive capacity for recursive mental state reasoning among multiple agents, crucial for understanding nested beliefs.
Recent benchmarks reveal that while LLMs perform well on lower-order tasks, accuracy sharply declines at higher orders, indicating persistent challenges.
Advanced prompting strategies and neuro-symbolic frameworks enhance LLMs’ higher-order ToM capabilities, offering paths to improved social cognition and safe AI alignment.

Higher-order Theory of Mind (ToM) in LLMs refers to the ability of these systems to recursively represent and reason about the beliefs, intentions, perspectives, and knowledge of multiple agents, including those nested within each other (e.g., “Anne thinks that Bob believes the apple is in the box”). This capability is crucial for advanced social cognition and communication, underpinning collaboration, negotiation, deception, and many aspects of human interaction. Recent research has begun to systematically probe and benchmark higher-order ToM in LLMs, revealing both significant progress and persistent limitations.

1. Foundations of Higher-Order Theory of Mind in LLMs

Theory of Mind is defined as the cognitive ability to attribute mental states to oneself and others, distinguishing perspectives and beliefs that may differ from one’s own or from objective reality. While earlier research and benchmarks focused primarily on first-order ToM (what one agent believes), higher-order ToM targets chains of mental state attributions (second-order: what agent A believes that agent B believes; n-th order: recursive chains up to “I think that you think that she knows…”).

Evaluation in LLMs adopts this structure. For instance, a fourth-order ToM query would ask: “Where does Anne think that Bob thinks that Sally thinks that Tom believes the marble is hidden?” (2310.16755).

2. Benchmarking and Methodologies for Higher-Order ToM

The introduction of systematic higher-order ToM benchmarks has enabled rigorous evaluation of LLMs’ capabilities in this domain.

HI-TOM (2310.16755): Explicitly evaluates LLMs on third- and fourth-order ToM questions using stories with agent actions, witnessed/unwitnessed events, and varying communication (including deception). Each story is paired with multiple questions targeting orders 0–4.
- Evaluation Metrics:
- Standard Accuracy: % correct at each order.
- Joint Accuracy: % correct at a given ToM order and all lower orders, measuring true recursive reasoning required for higher-order ToM.
MoToMQA (2405.18870): Provides a handwritten suite testing orders 2–6, with both ToM and factual controls, and compares human and LLM performance.
Recent Benchmarks such as ToMBench (2402.15052) and EnigmaToM (2503.03340) span a range from first- to higher-order ToM and cover complex, naturalistic scenarios, sometimes with additional annotations for emotion, intention, or knowledge.
Adversarial and Programmatic Datasets: Datasets like ExploreToM (2412.12175) programmatically generate diverse scenarios using a domain-specific language and adversarial search (A*), systematically exposing weaknesses in LLM ToM reasoning, especially in higher-order or less templated scenarios.

3. Empirical Findings and Model Performance Patterns

3.1 Order-Dependent Decline and Failure Modes

A consistent empirical finding is the dramatic decline of LLM accuracy with increasing order of ToM:

In HI-TOM, standard accuracy drops from ~100% at first order to near 0% at fourth order for even the best LLMs, and joint accuracy is even lower due to compounding errors (2310.16755).
Error analyses identify characteristic failure modes:
- Insufficient Reasoning Depth: LLMs answer at a lower order than required, skipping necessary embeddings of belief.
- Temporal Ignorance and Hallucination: Models often confound event order or introduce fabricated details.
- Commonsense and Causal Errors: Atypical reasoning about agent access or event consequences.

These errors point to LLMs’ difficulty in maintaining recursive mental state chains and inhibiting irrelevant world knowledge (“inhibitory control”), as confirmed by recent work on precursory inferences for ToM (2407.06004).

3.2 Model Size, Finetuning, and Scaling

Model size and finetuning have a strong but not linear effect on higher-order ToM:

GPT-4 and Flan-PaLM achieve or surpass adult human performance on second- through sixth-order ToM tasks in MoToMQA, with GPT-4 reaching 93% at order 6 (humans: 82%) (2405.18870).
Smaller or less fine-tuned models exhibit much weaker performance; instruction/following finetuning appears critical for generalized higher-order ToM capacity.
However, in adversarial or out-of-distribution settings (e.g., ExploreToM), even these models can score as low as 0–9% (2412.12175), indicating brittleness and the influence of dataset design.

4. Mechanisms and Emergent Properties

4.1 Neural and Computational Parallels

LLMs’ hidden states (“artificial neurons”) show emergent selectivity to belief types (true/false belief) in higher-order ToM tasks, reminiscent of dorsal medial prefrontal cortex (dmPFC) single-neuron selectivity in humans, although at lower rates (3.9% of embeddings in large LLMs vs. 23% in dmPFC) (2309.01660).
Logistic regression classifiers can decode belief states from the LLM’s internal representations with 75–81% accuracy for large models, indicating distributed, population-level coding of ToM-relevant features.

4.2 Prompting and Simulation Frameworks

Basic prompting and chain-of-thought (CoT) are often insufficient for higher-order ToM (2311.10227, 2310.16755). Substantial recent improvements come from:

Perspective-Taking Frameworks (SimToM) (2311.10227): Two-stage prompting filters story context to only what an agent knows before answering ToM queries. This approach outperforms CoT, and with oracle (human-provided) perspective-taking unlocks near-perfect ToM reasoning in non-saturating models.
Temporal Space and Belief Chains (TimeToM) (2407.01455): Constructs a temporalized narrative space and character-specific belief chains (TBSC), dividing self-world and social-world beliefs to systematically answer first- and higher-order ToM queries, often transforming higher-order ToM questions into tractable first-order queries via reasoning over shared perceptible time periods.
Iterative Masking with Neuro-symbolic Knowledge Bases (EnigmaToM) (2503.03340): Integrates a knowledge base of entity states and an event masking mechanism to recursively filter world state for each character in a belief chain, enabling efficient high-order ToM reasoning and yielding substantial gains on Hi-ToM and FANToM benchmarks.
Decomposition and Simulation Theory (Decompose-ToM) (2501.09056): Decomposes complex ToM tasks into modular steps—agent identification, question reframing, world state tracking, and knowledge access—recursively simulating each agent’s beliefs. This approach drastically boosts accuracy at high ToM orders compared to direct or chain-of-thought prompting.

5. Broader Implications, Limitations, and Future Directions

Effective higher-order ToM in LLMs is necessary for naturalist multi-agent collaboration, negotiation, moral judgment, and safe alignment with human values (2405.08154, 2310.10701).
Human-level or superior higher-order ToM opens opportunities for enhanced social reasoning in dialog agents, but also risks (e.g., manipulation, over-persuasion) (2405.18870).

5.2 Limitations and Pathologies

Despite progress, current LLMs remain brittle to adversarial task perturbation, “automatic state” surprises, and tasks requiring knowledge inhibition or integrating long, global context (2410.06271, 2501.01705).
Most benchmarks, even those measuring multi-order ToM, rely on short, templated narratives, potentially inflating LLM performance due to memorization, superficial heuristics, or data contamination (2402.15052, 2407.06004).
Program-guided adversarial datasets expose poor generalization and shallow reasoning even in state-of-the-art LLMs, especially as order or scenario complexity grows (2412.12175).

5.3 Directions for Advancement

Benchmark Design: There is consensus that current benchmarks insufficiently drive genuinely human-like higher-order ToM. Recommendations include expanding to richer, user-centered, contextually embedded, and interactional scenarios (2504.10839, 2501.01705).
Mechanism Development: Robust, plug-and-play neuro-symbolic modules, advanced perspective/temporal filtering, and recursive decomposition (as in EnigmaToM, TimeToM, Decompose-ToM) provide promising patterns for production-grade ToM reasoning in LLMs.
Interpretability: Mechanistic and embedding-probe studies (e.g., selective neurons, population decode) are critical for linking LLM ToM to biological social cognition and ensuring safe, interpretable AI (2309.01660).
Alignment and Ethics: Higher-order ToM amplifies both opportunity and risk in social alignment, requiring robust monitoring for manipulation, excessive anthropomorphism, and unintended consequences in human-AI interaction (2405.08154).

6. Summary Table: Advances and Challenges in Higher-Order ToM for LLMs

Mechanism / Finding	Observed Impact
Model scaling & finetuning	Essential for strong, generalized higher-order ToM (2405.18870)
Temporal and masking modules	Enable robust, order-scalable ToM reasoning (2407.01455, 2503.03340)
Perspective-taking prompts	Drastically improve false belief & higher-order task performance (2311.10227)
Adversarial benchmarks	Reveal persistent brittleness and shortcut use, even in leading LLMs (2412.12175)
Multimodal, user-centered eval	Needed for ecological validity and improved human-AI alignment (2504.10839)

7. Concluding Remarks

Current research demonstrates that LLMs can, under certain conditions and with sufficient scale and proper algorithmic support, achieve human-level or even super-human performance on higher-order ToM benchmarks. However, these abilities often rely on specific prompt structures, are challenged by adversarial or diverse scenarios, and may not reflect the true depth or generalization of social reasoning required for robust, real-world deployment. The field is now moving toward richer, contextually grounded, and mechanistically interpretable ToM in LLMs, with an emphasis on both capability and safe alignment.