HI-TOM Benchmark: Evaluating LLM Theory of Mind
- HI-TOM Benchmark is a suite of evaluations that rigorously tests large language models’ ability to perform recursive Theory of Mind reasoning.
- It comprises the FANToM/HI-TOM dialogue and Higher-Order object-tracking benchmarks, highlighting challenges with information asymmetry and nested belief attribution.
- Empirical results reveal a steep performance decline at higher ToM orders, underscoring the need for hybrid neural-symbolic approaches in LLM design.
HI-TOM Benchmark is a suite of evaluations designed to rigorously assess the abilities of LLMs to engage in Theory of Mind (ToM) reasoning at higher orders of belief recursion. It operationalizes ToM as a form of recursive mental state attribution, with experimental focus beyond classic false-belief tasks and into domains demanding multi-agent, nested belief tracking. Two primary benchmarks are associated with the HI-TOM name: the FANToM/HI-TOM conversational benchmark targeting information asymmetry and belief attribution in dialogue (Kim et al., 2023), and the Higher-Order HI-TOM benchmark emphasizing explicit multi-level belief nesting in object-tracking narratives (He et al., 2023). Together, these resources illuminate both the empirical limitations and the theoretical structure of ToM in state-of-the-art neural LLMs.
1. Theoretical Underpinnings and Scope
Theory of Mind (ToM), within cognitive science and developmental psychology, is the capacity to reason about the beliefs, desires, intentions, and knowledge of oneself and others (Premack & Woodruff, 1978). The core insight is that natural intelligent agents—humans—perform recursive attribution of mental states, forming beliefs about others’ beliefs and so on. "Orders" of ToM are defined recursively:
- Zero-order (ToM₀): Direct facts about the environment.
- First-order (ToM₁): Agent ’s belief about a fact: .
- Second-order (ToM₂): Agent ’s belief about ’s belief: .
- n-th order: .
- Formal recursive definition:
FANToM/HI-TOM (Kim et al., 2023) draws upon two psychological desiderata for ToM evaluation:
- Non-merging: Distinction between self and others’ mental states must not collapse.
- Mentalizing: Correct performance must depend on inference over latent mental representations, not shallow pattern matching.
Traditional ToM evaluations, such as narrative-based false-belief tasks, often contain surface cues or reporting bias. HI-TOM benchmarks attempt to tightly control or eliminate these confounds, embedding ToM challenges within multi-agent interactive settings, including information-asymmetric conversations and recursively structured object-location scenarios (Kim et al., 2023, He et al., 2023).
2. Benchmark Construction and Task Design
FANToM/HI-TOM: Information-Asymmetric Dialogue
FANToM/HI-TOM (Kim et al., 2023) comprises 256 automatically generated multi-party dialogues, each covering a general topic (e.g., pets, retirement planning) and associated subtopics. Participants enter and leave the conversation transiently (e.g., “I’ll grab a coffee,” “I’m back—what did I miss?”). Interleaved departures and re-entries induce structured information asymmetries: participants rejoining after absences lack access to utterances made during their absence.
For each dialogue:
- Participants: 2–5 (median 3)
- Turns per dialogue (short context):
- Words per turn:
- Dialogues per scenario: Short context (departure/return window) and full context ( turns)
HI-TOM: Higher-Order Recursive Reasoning
HI-TOM (He et al., 2023) is constructed from “Sally–Anne”-style stories, each spanning 1–3 chapters of an object-finding game with five agents. Choreographed entry, object movements, agent exits, and private/public communication (including deceptive claims) create complex belief structures.
For each story:
- Agents: 5 (e.g., Emma, Jack)
- Containers per story: 12–15
- Questions: One per ToM order ( to $4$), i.e., “Where is the object really?” through “Where does think thinks thinks thinks it is?”
- Ground-truth belief-tracking: Automated and manually verified for logical consistency
3. Question Types, Formal Metrics, and Scoring
FANToM/HI-TOM Question Typology
For each inaccessible information piece (a FactQ), six ToM-style questions are generated:
- FactQ: Factual, free response about the ground truth.
- BeliefQ: Free response on what agent would believe.
- Scored via Sentence-BERT embedding cosine distance, , and token-level F₁ overlap with limited knowledge rephrasing.
- BeliefMCQ: Multiple choice (two options: Full FactA, Limited FactA).
- AnswerabilityListQ: List all agents who know the correct answer.
- InfoAccessListQ: List all agents who have access to a given fact (Full FactA stated explicitly).
- AnswerabilityYN/InfoAccessYN: Binary (per character) for knowledge presence.
Aggregate metric: “AllQTypes” is 1 if all six questions per info-piece are answered correctly, 0 otherwise. “All*” accuracy is the mean of AllQTypes. "All" is used when excluding BeliefMCQ for human comparison.
Basic metrics:
HI-TOM Higher-Order Metrics
- Standard accuracy at each ToM order :
- Joint accuracy (for all orders $0$ through on a single story):
- Error taxonomy—see Section 5.
4. Empirical Results and Model Performance
FANToM/HI-TOM Performance (Kim et al., 2023)
| Model & Setting | All* | AllAnswerability | AllInfoAccess |
|---|---|---|---|
| Human | 87.5% | 90.6% | 90.6% |
| GPT-4 06/13 + CoT | 26.6% | 40.2% | 57.7% |
| ChatGPT + CoT | 3.7% | 20.7% | 17.1% |
| Flan-T5-XL + FT | 53.7% | 55.9% | 54.4% |
| Mistral/Falcon-Instruct (open-source, zero-shot) | 0.1% | — | — |
Observations:
- There is a pronounced gap between human and model performance, even for high-capacity LLMs with chain-of-thought (CoT) prompting and fine-tuning.
- Open-source zero-shot models fail almost entirely on composite metrics like All*.
- Chain-of-thought generally reduces false positives but introduces false negatives.
HI-TOM Higher-Order Results (He et al., 2023)
| Model | Style | w/o deception | w/ deception | Overall | Δ (CoT-Vanilla) |
|---|---|---|---|---|---|
| Guanaco-65B | VP | 49.3% | 42.0% | 45.7% | — |
| Guanaco-65B | CoT | 35.0% | 27.0% | 31.0% | –2.3 |
| Claude-instant | VP | 28.7% | 26.3% | 27.5% | — |
| Claude-instant | CoT | 31.5% | 24.8% | 28.2% | +0.7 |
| GPT-3.5-turbo | VP | 60.4% | 55.8% | 58.1% | — |
| GPT-3.5-turbo | CoT | 67.4% | 64.8% | 66.1% | +8.0 |
| GPT-4-32k | VP | 64.0% | 55.7% | 59.9% | — |
| GPT-4-32k | CoT | 65.6% | 55.6% | 60.5% | +1.8 |
Performance trends:
- Zero-order (factual) accuracy: near-perfect.
- Fast degradation at each higher ToM order: by order, accuracy for leading models drops to the 5–25% range.
- Deception reduces accuracy by up to 30% at higher orders.
5. Error Taxonomy and Model Failure Modes
The benchmarks reveal characteristic, recurring LLM failure types:
| Error Type | Description | Example Snippet |
|---|---|---|
| Insufficient Reasoning-Depth | Fails to perform all nested inference steps | Answers 1st for a 3rd-order query |
| Commonsense Errors | Violates real-world knowledge | “Aiden exited before … but still saw the move.” |
| Hallucinations | Unwarranted fabrication of details | “Benjamin lied about seeing a cat … real plan.” |
| Temporal Ignorance | Chronological confusion of events | “Lily exited before step 11 but …” |
| Spurious Causal Inference | Infers non-evident cause–effect relationships | “Private claim → no reason to doubt…” |
Frequency of such errors increases markedly with ToM order. At the highest orders, failing to maintain recursion depth and temporal consistency is the primary limiting factor.
Additional observations from FANToM/HI-TOM:
- Illusory ToM: LLMs may correctly answer free-form belief queries but fail structured answerability or information-access queries, indicating lack of internally consistent belief modeling.
- Fact vs. Belief Suppression: Success on factual extraction often does not translate to suppression of omniscient information required for true ToM reasoning.
- Format Sensitivity: Multiple-choice questions are easier for models than free-form questions; however, even multiple-choice accuracy often remains at chance.
- Reasoning Complexity: Multi-step list questions (e.g., “Who knows the fact?”) are substantially harder than single-step questions.
6. Implications and Theoretical Insights
The HI-TOM suite delineates a strict empirical boundary between the shallow, associative “pattern recognition” characteristic of current LLMs (“System 1”) and the explicit, recursive, symbolic inference required by higher-order ToM (“System 2”) (He et al., 2023). The steep drop in performance as ToM order increases emphasizes that large neural LLMs frequently default to shortcut heuristics, skip inference steps, or propagate unsubstantiated information.
- The benchmarks highlight the need for hybrid neural-symbolic architectures combining pattern recognition (LLM) modules with explicit symbolic belief trackers, particularly for high-order ToM tasks.
- They serve as controlled testbeds for deception detection, empathetic conversational agents, and multi-party dialogue systems.
- Conversely, HI-TOM-style tasks offer a unique lens on human cognition, although direct analogy must be drawn carefully due to fundamental architectural differences.
7. Limitations and Prospective Directions
Current HI-TOM results are bounded by several factors:
- Only a limited set of LLMs (GPT-4, GPT-3.5, Claude, Guanaco) have been rigorously evaluated.
- The impact of reinforcement learning from human feedback (RLHF), retrieval augmentation, or alternative fine-tuning regimes has not been established.
- HI-TOM’s scripted stories, while logically consistent, may not capture the diversity or implicitness of real-world conversational belief cues.
Future research avenues include:
- Extending benchmarks to genuine conversational transcripts and more implicit attributions.
- Integrating explicit belief-tracking or symbolic inference mechanisms into LLM pipelines.
- Examining cross-lingual ToM reasoning and multilingual transfer.
- Assessing the effect of RLHF, retrieval augmentation, and fine-tuning on ToM depth and consistency.
HI-TOM marks a critical advance in ToM evaluation for LLMs, demarcating the limits of current architectures and charting a path forward for systems that more closely approximate human-level belief reasoning (Kim et al., 2023, He et al., 2023).