OpenToM Benchmark: Neural ToM in LLMs
- OpenToM Benchmark is a large-scale testbed that assesses LLMs' ability to infer and reason about mental states through detailed, multi-paragraph narratives.
- It uses GPT-3.5-Turbo and rule-based templates to generate actions driven by explicit personality traits and preferences, providing both physical and psychological reasoning challenges.
- Evaluation results show advanced coarse reasoning and multi-hop deduction in models like GPT-4-Turbo, while highlighting difficulties in fine-grained location tracking and accurate attitude inference.
OpenToM is a large-scale, human-in-the-loop benchmark developed to evaluate the Neural Theory-of-Mind (N-ToM) capabilities of LLMs. N-ToM refers to a machine's ability to infer, track, and reason over the mental states of others—an essential prerequisite for socially intelligent artificial agents. OpenToM addresses key deficiencies present in prior benchmarks by providing naturalistic, multi-paragraph stories featuring explicitly personified agents, intentions linked to personality and preferences, and a comprehensive taxonomy of physical and psychological theory-of-mind (ToM) questions, resulting in a testbed for both first-order and higher-order perspective reasoning challenges (Xu et al., 2024).
1. Motivation and Design Principles
Preceding ToM-oriented machine benchmarks, such as ToMi, Hi-ToM, and T4D, exhibit several limitations: highly templated or ambiguous narratives, the absence of explicit personality traits and preferences, actions detached from plausible motivations, and a restrictive focus on physical-world reasoning exemplified by false-belief paradigms. OpenToM is constructed to overcome these issues via four principal interventions:
- Generation of multi-paragraph, human-refined narratives averaging over 190 tokens, with richer context and clarity.
- Explicit assignment of one of three personality traits (considerate, inconsiderate, negativistic) to protagonists, and sampling of preferences for both 'mover' and 'observer' agents.
- Synthesis of actions from the latent intentions of agents, driven by their traits and beliefs about each other's preferences, operationalized through rule-based templates.
- Construction of a diverse set of unified, multi-hop questions probing both physical (e.g., location, accessibility, fullness) and psychological (e.g., attitudes, social norms) ToM, at both first- and second-order levels.
Spurious cue mitigation is integral to the process: ambiguous wording is systematically revised and causal-graph interventions are applied to prevent uninformative shortcuts, ensuring that neither agents' intentions nor personalities can singly predict targeted attitudes [(Xu et al., 2024), §2.3].
2. Dataset Structure and Narrative Construction
OpenToM comprises 696 standard stories supplemented by 100 extra-long (–L) narratives, collectively exceeding 13,700 QA pairs. Each narrative contains two explicitly assigned roles—a 'mover' and an 'observer'—and revolves around a single 'entity of interest' distributed among containers or locations.
Personality traits are uniformly sampled for movers, and true/false preferences are assigned to both agents. The mover's belief about the observer's preference may be perturbed, enforcing decoupling between observable cues and actual latent variables. Agent intentions are instantiated through GPT-3.5-Turbo conditioned on these traits and beliefs, and distilled to singular motivating actions (e.g., moveTo(container), takeFrom(container), discard(...), showOff(...)), with each action traceable to a clearly articulated intention ("to make it more accessible," "to get rid of it") [(Xu et al., 2024), §A.2, Algorithm 1].
Table: Overview of OpenToM Narrative Construction
| Component | Approach | Example |
|---|---|---|
| Narrative Length | >190 tokens; –L: ~492 tokens | Multi-paragraph story |
| Personality Assignment | {considerate, inconsiderate, negativistic} | Mover: considerate; Observer: indifferent |
| Action Generation | GPT-3.5-Turbo + rule-based templates | moveTo(container), showOff(...) |
| Spurious Cue Revision | Human revision + causal-graph interventions | Rewriting ambiguous reasoning chains |
3. Question Taxonomy and Cognitive Challenge
Each standard narrative is associated with up to 23 distinct questions, extending to 31 if distinct attitude probes are separately enumerated, yielding tasks requiring both physical and psychological mental state inference. The main categories are:
- Loc_coarse (binary): Probes for entity presence from various perspectives—first- or second-order beliefs regarding physical location.
- Loc_fine (ternary): Demands precise identification of an entity's current location.
- Multi-Hop (ternary): Fullness and accessibility queries requiring at least 3-hop deductive reasoning and application of social commonsense.
- Attitude (ternary): Requires attribution of the observer's attitude (positive, neutral, negative) toward the mover's action, integrating passion, perspective, and observed action.
Representative reasoning traces (chain-of-thought) accompany gold answers, illuminating the inferential tracks required (e.g., from observed action to preference recognition to attitudinal inference) [(Xu et al., 2024), §3].
4. Evaluation Protocol and Metrics
Six prominent LLMs are evaluated under zero-shot conditions: Llama2-Chat (7B, 13B, 70B), Mixtral-8x7B-Instruct, GPT-3.5-Turbo, and GPT-4-Turbo. Four prompting paradigms are considered: vanilla, Chain-of-Thought (CoT), SimulatedToM (SimToM), and Self-Ask. A fine-tuning baseline is established via LoRA optimization of Llama2-Chat-13B on a balanced subset.
The principal metric is macro-averaged F1 score:
where each class is equally weighted. Accuracy and "corruption rate" (i.e., the proportion of unparseable outputs) are also reported [(Xu et al., 2024), §4.2].
5. Experimental Results and Error Analysis
GPT-4-Turbo achieves the highest macro-F1 on Loc_coarse (0.643), multi-hop (0.658), and Attitude (0.544). Fine-tuned Llama2-Chat-13B nearly matches human performance on Loc_coarse (0.978) and multi-hop (0.936) but falls short on Loc_fine (0.600) and Attitude (0.547). CoT prompting yields substantial performance gains on Loc_coarse and multi-hop for all models (up to +0.30 F1 for GPT-4) but is neutral or deleterious on Loc_fine and Attitude questions. SimToM provides moderate enhancement for Mixtral-8x7B-Instruct on multi-hop tasks (+0.09 F1) without improving Attitude scores. Increased narrative length (–L) reduces model efficacy on location and Attitude probes [(Xu et al., 2024), Table 3–4, A.12].
In-depth error analysis reveals:
- Unfaithfulness: Inconsistent answers to Loc_coarse and Loc_fine on the same story, with partial mitigation via joint prompting.
- ToM-Role Asymmetry: Difficulty increases for LLMs when reasoning about the mover’s view of the observer’s beliefs compared to other roles.
- Psychological ToM Biases: Low recall on neutral/positive attitudes; most errors correlate with mover personality rather than independent attitudinal inference—over 90% of errors are attributable to this spurious association [(Xu et al., 2024), Table 8].
6. Implications and Directions for Advancement
Findings indicate that contemporary LLMs exhibit competent tracking of physical-world beliefs with advanced prompting or targeted fine-tuning but encounter persistent challenges in:
- Faithful multi-step reasoning across coarse-to-fine-grained queries.
- Accurate perspective shifts among different ToM orders and agent roles.
- Robust inference of psychological attitudes decoupled from explicit personality-to-emotion mappings.
Proposed future directions include the development of neuro-symbolic pipelines (e.g., PaL, faithful CoT), role-aware ToM frameworks that identify agent roles prior to specialized reasoning, and enhanced social-commonsense state-tracking using dynamic knowledge graphs for nuanced attitudinal modeling [(Xu et al., 2024), §6].
OpenToM thus constitutes a rigorously engineered, public benchmark enabling holistic assessment of the ToM capacities of machine models in both the physical and psychological domains, and provides a challenging platform for subsequent advancements in machine social cognition.