LLM ToM Reasoning Test Set

Updated 30 October 2025

LLM-ToM-Reasoning test set is a comprehensive benchmark designed to evaluate explicit Theory of Mind and higher-order reasoning capabilities in LLMs.
It employs controlled problem generation, diverse scenarios, and multifaceted questioning to assess belief attribution, recursive reasoning, and adaptability.
Empirical findings reveal that while LLMs excel at explicit ToM tasks, they struggle with dynamic application, recursive inference, and overcoming cognitive rigidity.

The LLM-ToM-Reasoning test set paradigm encompasses a collection of methodologies, datasets, and evaluation protocols developed to rigorously assess the Theory of Mind (ToM) and higher-order reasoning capacities of LLMs. These test sets isolate specific cognitive constructs such as belief attribution, recursive mental state reasoning, adaptability to context, resistance to cognitive rigidity (mental set), and the flexible application of inferred knowledge in both synthetic and ecologically valid scenarios. The surge in such benchmarks is driven by the need to move beyond mere accuracy measures and to scrutinize human-like social cognition, inference, and the ability to overcome entrenched reasoning patterns in LLMs.

1. Historical Development and Motivation

Classic ToM tests (e.g., Sally-Anne, Smarties) were designed for human developmental psychology and adapted early for NLP models, often focusing on simple false-belief inference. However, recent research has identified several limitations in this approach: most work measures only direct (explicit) ToM—whether the model can attribute knowledge/ignorance—but neglects applied ToM (the downstream use of such knowledge), higher-order recursion (reasoning about others’ beliefs about beliefs), and dynamic adaptability (altering reasoning strategies in response to contextual cues or cognitive traps) (Gu et al., 17 Oct 2024, He et al., 2023, Haq et al., 21 Jan 2025).

Test sets are consequently being redesigned to demand a spectrum of ToM competencies:

Explicit inference (who knows/sees what at a given moment)
Dynamic, context-sensitive application (predicting actions, explaining rationality given beliefs)
High-order recursive reasoning (nested beliefs in multi-agent settings)
Flexibility and resistance to cognitive bias or mental set

2. Core Methodological Principles

LLM-ToM-Reasoning test sets leverage rigorous, often formal, scenario generation paradigms to ensure that success requires structured inference rather than shallow pattern-matching. Central design elements include:

Controlled Problem Generation via Logic Formalisms: Using frameworks such as dynamic epistemic logic (DEL), tasks are constructed to allow granular control over the facts, observability, and public/private announcements influencing each agent’s knowledge state (Sileo et al., 2023). This prevents model success via data artifacts and enables fine-tuning of problem complexity and order of reasoning required.
Scenario Diversity and Information Asymmetry: Test sets sample from a diverse range of real-world contexts—deception, manipulation, occluded knowledge, and social games—to assess robustness (Gu et al., 17 Oct 2024, Shinoda et al., 15 Jan 2025, Agashe et al., 2023).
Multifaceted Questioning: Each story or scenario often yields multiple question types, targeting different reasoning depths: explicit mental state, action prediction, rationality judgement, intention, and desire inference (sometimes in first- and higher-order variants).
Prompt and Reasoning Complexity Modulation: Benchmarks manipulate prime/order effects (e.g., complex-before-shortcut to induce mental set), chain-of-thought scaffolding, and the requirement for stepwise explanations to probe both performance and procedural adaptability (Haq et al., 21 Jan 2025, Xiong et al., 20 May 2025).

3. Benchmark Structures and Taxonomies

LLM-ToM-Reasoning test sets cover a range of reasoning regimes and complexity classes:

Benchmark/Test Set	Reasoning Type(s) Probed	Distinct Features
MindGames	DEL-based, formal, 1st/2nd	Symbolic logic, ground-truth by model checker, knowledge/belief updates (Sileo et al., 2023)
HI-TOM	Orders 1–4, deception	Multi-agent, multi-chapter, recursive belief reasoning, joint accuracy analysis (He et al., 2023)
SimpleToM	Explicit/applied ToM	Explicit (mental state), behavior, rationality; scenario diversity (Gu et al., 17 Oct 2024)
ToMATO	Multi-state, false-belief	LLM-LLM roleplay, inner-speech annotation, personality variation, first- and second-order false beliefs (Shinoda et al., 15 Jan 2025)
PersuasiveToM	BDI in dialogue	Persuasion tasks, evolution of intentions and beliefs, application to dialogue action (Yu et al., 28 Feb 2025)
LLM-Coordination	Social multi-agent	Interactive games, joint planning, coordination QA—environmental and ToM (Agashe et al., 2023)
“Pick the Right Stuff”	Belief history handling	Zero/finite/infinite historical reasoning taxonomy (Tang et al., 7 Jun 2024)
ToM-LM, Decompose-ToM	Symbolic/decomposed or simulated ToM	External DEL reasoning, recursive simulation, algorithmic decomposition (Tang et al., 23 Apr 2024, Sarangi et al., 15 Jan 2025)

This diversity of structure allows systematic probing of both shallow and deep forms of ToM, as well as dynamic, role-sensitive social cognition.

4. Key Empirical Findings Across Benchmarks

Several consistent trends emerge across recent LLM-ToM-Reasoning test set studies:

Explicit ToM is often “solved” by large models; applied ToM is not: On datasets like SimpleToM, LLMs (GPT-4o, Claude-3.5, Llama-3.1) achieve near-ceiling performance in identifying what a character knows. However, their ability to predict downstream actions or rationalize behaviors given those beliefs is much lower, typically near chance unless aided by explicit reminders or chain-of-thought patching (Gu et al., 17 Oct 2024).
Recursive and Higher-Order ToM Remains a Bottleneck: On benchmarks such as HI-TOM, accuracy on first-order ToM can approach 60%, but joint accuracy on 4th-order nested beliefs is near zero for all SOTA models. The majority of errors derive from insufficient recursion depth, temporal confusion, and hallucinated or inconsistent reasoning (He et al., 2023).
Cognitive Rigidity (“Mental Set”) Limits Adaptability: LLMs, including GPT-4o and Llama-3.1, persistently apply overelaborate, multi-step strategies when exposed first to complex versions, showing reluctance to switch to more efficient solutions on subsequent shortcut problems. Chain-of-thought prompting increases answer accuracy but also entrenches longer, less flexible reasoning, amplifying the mental set effect (Haq et al., 21 Jan 2025).
Interventions Can Boost but Not Generalize ToM Application: Prompt-engineered reminders, tailored chain-of-thought scaffolds, or explicit simulation-based decomposition can dramatically raise accuracy (sometimes matching or exceeding 90% on applied ToM tasks in-domain), but these benefits are brittle—often failing to transfer to out-of-domain tasks and requiring heavy, scenario-specific intervention (Gu et al., 17 Oct 2024, Sarangi et al., 15 Jan 2025).
Human-to-LLM ToM Gaps Are Persistent, Especially for False Beliefs and Realistic Social Scenarios: On ToMATO, even GPT-4o mini scores more than 10 points below human levels, especially for reasoning about others’ false beliefs, with systematic deficits in handling certain personality traits and robustly tracking unobservable mental states (Shinoda et al., 15 Jan 2025).
Model Size Does Not Guarantee ToM Reasoning Strength: Benchmarks such as “Pick the Right Stuff” demonstrate that smaller models (e.g., Gemma-7B, Mistral-7B) can outperform much larger models (e.g., GPT-3.5-turbo) on belief history tasks, suggesting that parameter count is not a direct proxy for ToM proficiency (Tang et al., 7 Jun 2024).

5. Evaluation Metrics and Analysis Techniques

LLM-ToM-Reasoning test sets rely on more nuanced metrics than task accuracy alone. Common measures include:

Exact Match (EM): Proportion of exactly correct end answers ( $\mathrm{EM}_M = \frac{\text{Correct}}{\text{Total}}$ ).
Joint Accuracy: Proportion of test cases where the model correctly answers all required nested or sequential ToM questions.
Chain/Step Count: Average number of reasoning steps used in correct answers; sensitive to procedural efficiency and entrenchment.
Behavioral/Applied Metrics: Accuracy in predicting actions or judgements about rationality (as in SimpleToM), behavioral strategy selection (as in PersuasiveToM), or joint planning (as in LLM-Coordination).
Robustness to Personality/Scenario Variation: Performance stratified by personality dimension, scenario type, or domain.
Error Taxonomies: Manual and semi-automated codes for error types—insufficient recursion, hallucination, causal confusion, position/attention-related mistakes (see HI-TOM, ToMATO).

Advanced diagnostic techniques include reasoning graph topology analysis; for example, mapping chain-of-thought outputs to semantic graphs allows quantification of branching, convergence, and overall structure, which are shown to strongly correlate with answer correctness and flexibility (Xiong et al., 20 May 2025).

6. Implications for LLM Research and ToM Benchmark Evolution

Empirical results from these test sets reveal that superficial or template-based ToM benchmarks dramatically overestimate LLM ToM capacities. To advance both measurement and model capabilities:

Next-generation benchmarks must integrate adaptability, procedural efficiency, scenario diversity, hierarchical mentalizing (order of beliefs), and resistance to cognitive rigidity into their core metricization (Haq et al., 21 Jan 2025, Wagner et al., 18 Dec 2024).
Evaluations should explicitly distinguish between explicit state inference, applied behavioral reasoning, and the meta-decision of when/how to invoke ToM (Wagner et al., 18 Dec 2024).
Simulation and decomposition-based inference procedures, symbolic delegation (e.g., to external model checkers), and scenario order effects must be incorporated both for benchmarking and model development (Sarangi et al., 15 Jan 2025, Tang et al., 23 Apr 2024).
Practical impact: These advances are crucial for building AI systems that safely and efficiently interact in social, collaborative, and uncertain environments—requiring not only correct knowledge attribution but also the flexible, context-aware use of that knowledge.

A plausible implication is that as LLMs are deployed in decision-critical domains, the ability to detect, quantify, and correct for cognitive rigidity and shallow imitation will become a key differentiator in AI safety and trustworthiness. LLM-ToM-Reasoning test sets are thus positioned as foundational resources, not just for ToM benchmarking but for principled advances in socially intelligent machine reasoning.