Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenToM: Neural Theory-of-Mind Benchmark

Updated 10 February 2026
  • OpenToM is a benchmark framework for evaluating neural Theory-of-Mind in LLMs using naturalistic narratives that model both physical and psychological states.
  • It employs human-refined stories featuring explicit personality traits and causal consistency to challenge traditional belief-tracking and intention modeling.
  • The framework uses rigorous metrics, including macro-averaged F1 scores and accuracy, to assess in-distribution performance and generalization gaps.

OpenToM is a comprehensive benchmark framework for evaluating the neural Theory-of-Mind (N-ToM) capabilities of LLMs, with an explicit focus on modeling both the physical and psychological mental states of agents within naturalistic, intention-driven narratives. It is designed to address several deficiencies in previous N-ToM benchmarks, providing a publicly available, systematically annotated, and diverse suite of narrative scenarios challenging LLMs well beyond traditional belief-tracking (Xu et al., 2024, Sarangi et al., 21 Jul 2025).

1. Motivations and Theoretical Rationale

Existing N-ToM benchmarks—such as ToMi, T4D, Hi-ToM, Adv-CSFB, ExploreToM, and FANToM—have operated largely with template-generated, artificially structured stories and have focused predominantly on physical state belief-tracking (e.g., who knows what about an object's location). These datasets lack rich characterization, intrinsic motivation for actions, and comprehensive treatment of psychological state reasoning. OpenToM was created to challenge these limitations by integrating longer, “natural” LLM-generated stories, explicit personality traits, intentional action selection, and an expanded range of questions, especially those targeting psychological-world attributions (e.g., affective stance, preference-influence) (Xu et al., 2024, Sarangi et al., 21 Jul 2025).

2. Benchmark Structure and Content

OpenToM consists of 696 standard and 100 extra-long narratives, each constructed and human-refined to maintain narrative clarity and causal consistency. Each story introduces two protagonists—a “mover” (the actor) and an “observer”—with explicit personality traits (considerate, inconsiderate, negativistic), individual preferences, and beliefs about one another's preferences. Actions are driven via a two-step human-in-the-loop process to ensure that every movement is anchored in a psychological intention.

Each story is probed with a suite of 23 questions, spanning both first-order and second-order ToM, and encompassing both physical- and psychological-state reasoning:

  • Physical-World Tasks:
    • Location-coarse (Loc₍c₎): “Does character C believe the entity is still in its initial location?”
    • Location-fine (Loc₍f₎): “Where does character C believe the entity is?”
    • Multi-hop reasoning (MHop): E.g., agent's inference about container fullness or accessibility, requiring at least two inference steps across the narrative.
  • Psychological-World Tasks:
    • Attitude (Att): Observer’s evaluative attitude (positive/neutral/negative) towards the mover's action, given only observable preference and event evidence—the intention itself is hidden.

Questions are balanced in label distribution to avoid bias toward “majority-class” answers. Each narrative, by construction, ends immediately after the key event to guarantee a closed-world context for unambiguous ToM inference (Xu et al., 2024, Sarangi et al., 21 Jul 2025).

3. Evaluation Protocols and Metrics

OpenToM formalizes every question as a binary or ternary classification problem. The primary evaluation metric is the macro-averaged F1 score:

F1macro=1Ci=1C2PrecisioniRecalliPrecisioni+Recalli\mathrm{F1}_{\mathrm{macro}} = \frac{1}{C} \sum_{i=1}^{C} \frac{2\cdot \mathrm{Precision}_i\,\mathrm{Recall}_i}{\mathrm{Precision}_i + \mathrm{Recall}_i}

where CC is the number of classes (2 or 3). Accuracy is also computed as

Accuracy=Number of Correct AnswersTotal Number of Questions\text{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}}

and is used for human agreement assessment and baseline reporting. Model inference is formulated using character-centric and full-narrative contexts, with explicit or latent action and preference cueing.

Results are reported both overall and per-question-type, with further breakdown by narrative length, ToM order (first/second), and role (mover/observer). Error analyses include contradiction metrics (e.g., the “unfaithful rate” between Loc₍c₎ and Loc₍f₎ predictions) (Xu et al., 2024).

4. Empirical Findings and Generalization Characteristics

Human inter-annotator agreement on OpenToM is consistently high (F1 ≥ 0.85). State-of-the-art, zero-shot LLMs (e.g., GPT-4-Turbo) achieve moderate-to-high performance on physical-world belief-tracking—such as Loc₍c₎ FO (0.643), MHop FO (0.658)—but perform poorly on psychological questions (Att: 0.544) and on fine-grained, second-order location inference (Loc₍f₎ SO: 0.269). Fine-tuning (e.g., LoRA on Llama2-Chat-13B) can push physical-world F1 scores above 0.9 but does not close the gap in psychological-state tasks (Att ≈ 0.547).

Role-based breakdown reveals LLMs are more prone to error when computing second-order beliefs about the mover compared to the observer, due to increased inference-chain length and latent belief dependencies. Longer narratives degrade performance on Loc₍f₎ and Att, implying that context-window limitations or narrative structure complexity impede deep reasoning.

Chain-of-Thought prompting dramatically improves certain tasks (up to +30 F1 for Loc₍c₎ and MHop) but offers little benefit—or can even degrade—fine-grained location and attitude prediction. Explicit perspective-taking (SimToM prompting) is similarly effective only in specific subtypes (Xu et al., 2024).

A distinctive feature of OpenToM is its use as an out-of-distribution evaluation set. RLVR-trained small LLMs (e.g., Qwen2.5-7B-Instruct) display strong in-distribution benchmark mastery but generalize poorly, clustering at 57–62% accuracy on OpenToM, with generalization gaps Δgen\Delta_{\mathrm{gen}} exceeding 20% in all settings. This is interpreted as narrow overfitting: accuracy rises steadily in-distribution but stagnates or declines out-of-distribution, indicating that models are “hacking” superficial statistical patterns rather than acquiring robust, transferrable ToM reasoning (Sarangi et al., 21 Jul 2025).

Training Regime Accuracy (In) Accuracy (OpenToM) Generalization Gap
HiToM only (Hi) 82.9% 59.9% 23.0%
FANToM only (Fan) 91.5% 59.9% 31.6%
ExploreToM only (Exp) 85.1% 60.0% 25.1%
Best combination (Hi×Fan) ~87.2% 61.8% ~26%

Generalization failures are further dissected via: (1) divergence of in- vs. OOD learning curves, (2) inversion of difficulty order on HiToM, (3) brittleness to output format, and (4) negative transfer between dataset domains (Sarangi et al., 21 Jul 2025).

5. Distinctions from Prior Theory-of-Mind Benchmarks

OpenToM substantially diverges from prior N-ToM datasets in several aspects:

  • Narrative Authenticity: Unlike template-generated stories in HiToM and binary Q&A in FANToM, OpenToM leverages LLM-driven, multi-paragraph "natural" stories, resulting in greater linguistic and causal diversity.
  • Personality and Intentionality: Each protagonist is strongly personified, and every pivotal action is causally linked to explicit intention, providing an anchoring basis for psychological-state reasoning.
  • Task Formulation: Task coverage extends beyond simple false-belief tracking to include multi-hop accessibility, fullness, and sophisticated attitude assessment—tasks ill-posed in prior benchmarks.
  • Evaluation Design: Balanced label distributions, narrative closure, and extensive annotation ensure low ambiguity and minimize dataset artifacts.
  • Out-of-Distribution Challenge: OpenToM is explicitly held out for OOD testing, exposing shortcut learning and "statistical hacking" that are otherwise hidden on in-domain evaluations (Xu et al., 2024, Sarangi et al., 21 Jul 2025).

6. Implications, Challenges, and Future Directions

Analysis using OpenToM indicates that LLMs closely approach human-level Theory-of-Mind on physical tasks with sufficient fine-tuning or prompting, but psychological-state tasks (attitude, preference-tracking) remain an open frontier—current SOTA models are consistently 30–40 F1 points below human agreement.

The documented failure of RLVR-trained small LLMs to generalize on OpenToM highlights structural limitations: narrow overfitting to dataset idiosyncrasies, reward mechanisms that privilege output correctness over process faithfulness, and susceptibility to non-transferable policies. The authors advocate several directions:

  • Development of curriculum-spanning, diverse ToM training datasets to reduce shortcut exploitation.
  • Adoption of reward systems that explicitly target reasoning trajectories (e.g., chain-of-thought fidelity) rather than just end-state accuracy.
  • Incorporation of adversarial or held-out splits within training regimes to regularize against overfitting.
  • Integration of symbolic belief-tracking or human-in-the-loop evaluation to robustly judge reasoning fidelity.
  • Methodological enhancements, such as neuro-symbolic consistency enforcement (e.g., Program-aided Language), role-aware reasoning frameworks, and character-centric world state modeling (Xu et al., 2024, Sarangi et al., 21 Jul 2025).

OpenToM thus provides a testbed for future socially-intelligent AI systems by foregrounding where present-day models succeed (physical belief-tracking under rich narrative shifts) and where they remain challenged (robust, generalized psychological reasoning).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenToM.