HI-TOM Benchmark: Evaluating LLM Theory of Mind

Updated 10 November 2025

HI-TOM Benchmark is a suite of evaluations that rigorously tests large language models’ ability to perform recursive Theory of Mind reasoning.
It comprises the FANToM/HI-TOM dialogue and Higher-Order object-tracking benchmarks, highlighting challenges with information asymmetry and nested belief attribution.
Empirical results reveal a steep performance decline at higher ToM orders, underscoring the need for hybrid neural-symbolic approaches in LLM design.

HI-TOM Benchmark is a suite of evaluations designed to rigorously assess the abilities of LLMs to engage in Theory of Mind (ToM) reasoning at higher orders of belief recursion. It operationalizes ToM as a form of recursive mental state attribution, with experimental focus beyond classic false-belief tasks and into domains demanding multi-agent, nested belief tracking. Two primary benchmarks are associated with the HI-TOM name: the FANToM/HI-TOM conversational benchmark targeting information asymmetry and belief attribution in dialogue (Kim et al., 2023), and the Higher-Order HI-TOM benchmark emphasizing explicit multi-level belief nesting in object-tracking narratives (He et al., 2023). Together, these resources illuminate both the empirical limitations and the theoretical structure of ToM in state-of-the-art neural LLMs.

1. Theoretical Underpinnings and Scope

Theory of Mind (ToM), within cognitive science and developmental psychology, is the capacity to reason about the beliefs, desires, intentions, and knowledge of oneself and others (Premack & Woodruff, 1978). The core insight is that natural intelligent agents—humans—perform recursive attribution of mental states, forming beliefs about others’ beliefs and so on. "Orders" of ToM are defined recursively:

Zero-order (ToM₀): Direct facts about the environment.
First-order (ToM₁): Agent $i$ ’s belief about a fact: $B_i p$ .
Second-order (ToM₂): Agent $j$ ’s belief about $i$ ’s belief: $B_j(B_i p)$ .
n-th order: $B_{i_1}(B_{i_2}(\ldots B_{i_n} p))$ .
Formal recursive definition:

$\mathrm{ToM}_0(p) = p,\quad \mathrm{ToM}_{k+1}(i_1, \ldots, i_{k+1}; p) = B_{i_1}(\mathrm{ToM}_k(i_2,\ldots,i_{k+1}; p))$

FANToM/HI-TOM (Kim et al., 2023) draws upon two psychological desiderata for ToM evaluation:

Non-merging: Distinction between self and others’ mental states must not collapse.
Mentalizing: Correct performance must depend on inference over latent mental representations, not shallow pattern matching.

Traditional ToM evaluations, such as narrative-based false-belief tasks, often contain surface cues or reporting bias. HI-TOM benchmarks attempt to tightly control or eliminate these confounds, embedding ToM challenges within multi-agent interactive settings, including information-asymmetric conversations and recursively structured object-location scenarios (Kim et al., 2023, He et al., 2023).

2. Benchmark Construction and Task Design

FANToM/HI-TOM: Information-Asymmetric Dialogue

FANToM/HI-TOM (Kim et al., 2023) comprises 256 automatically generated multi-party dialogues, each covering a general topic (e.g., pets, retirement planning) and associated subtopics. Participants enter and leave the conversation transiently (e.g., “I’ll grab a coffee,” “I’m back—what did I miss?”). Interleaved departures and re-entries induce structured information asymmetries: participants rejoining after absences lack access to utterances made during their absence.

For each dialogue:

Participants: 2–5 (median 3)
Turns per dialogue (short context): $\approx13.8$
Words per turn: $\approx21.9$
Dialogues per scenario: Short context (departure/return window) and full context ( $\sim24.5$ turns)

HI-TOM: Higher-Order Recursive Reasoning

HI-TOM (He et al., 2023) is constructed from “Sally–Anne”-style stories, each spanning 1–3 chapters of an object-finding game with five agents. Choreographed entry, object movements, agent exits, and private/public communication (including deceptive claims) create complex belief structures.

For each story:

Agents: 5 (e.g., Emma, Jack)
Containers per story: 12–15
Questions: One per ToM order ( $k=0$ to $4$), i.e., “Where is the object really?” through “Where does $A_4$ think $A_3$ thinks $A_2$ thinks $A_1$ thinks it is?”
Ground-truth belief-tracking: Automated and manually verified for logical consistency

3. Question Types, Formal Metrics, and Scoring

FANToM/HI-TOM Question Typology

For each inaccessible information piece (a FactQ), six ToM-style questions are generated:

FactQ: Factual, free response about the ground truth.
BeliefQ: Free response on what agent $X$ $X$ would believe.
- Scored via Sentence-BERT embedding cosine distance, $d(c, a_{personX}) < d(c, a_{omni})$ , and token-level F₁ overlap with limited knowledge rephrasing.
BeliefMCQ: Multiple choice (two options: Full FactA, Limited FactA).
AnswerabilityListQ: List all agents who know the correct answer.
InfoAccessListQ: List all agents who have access to a given fact (Full FactA stated explicitly).
AnswerabilityYN/InfoAccessYN: Binary (per character) for knowledge presence.

Aggregate metric: “AllQTypes” is 1 if all six questions per info-piece are answered correctly, 0 otherwise. “All*” accuracy is the mean of AllQTypes. "All" is used when excluding BeliefMCQ for human comparison.

Basic metrics:

$\text{Accuracy} = \frac{\#\text{correct}}{\#\text{total}}$
$F_1 = 2 \cdot \frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$

HI-TOM Higher-Order Metrics

Standard accuracy at each ToM order $k$ :

$\mathrm{Acc}_k = \frac{1}{N}\sum_{(s, k)} \mathbf{1}\{\hat{a}_{s,k}=a^\star_{s,k}\}$

Joint accuracy (for all orders $0$ through $k$ on a single story):

$\mathrm{JointAcc}_k = \frac{1}{|\{\text{stories}\}|}\sum_{s} \prod_{j=0}^k \mathbf{1}\{\hat{a}_{s,j}=a^\star_{s,j}\}$

Error taxonomy—see Section 5.

4. Empirical Results and Model Performance

Model & Setting	All*	AllAnswerability	AllInfoAccess
Human	87.5%	90.6%	90.6%
GPT-4 06/13 + CoT	26.6%	40.2%	57.7%
ChatGPT + CoT	3.7%	20.7%	17.1%
Flan-T5-XL + FT	53.7%	55.9%	54.4%
Mistral/Falcon-Instruct (open-source, zero-shot)	$\leq$ 0.1%	—	—

Observations:

There is a pronounced gap between human and model performance, even for high-capacity LLMs with chain-of-thought (CoT) prompting and fine-tuning.
Open-source zero-shot models fail almost entirely on composite metrics like All*.
Chain-of-thought generally reduces false positives but introduces false negatives.

Model	Style	w/o deception	w/ deception	Overall	Δ (CoT-Vanilla)
Guanaco-65B	VP	49.3%	42.0%	45.7%	—
Guanaco-65B	CoT	35.0%	27.0%	31.0%	–2.3
Claude-instant	VP	28.7%	26.3%	27.5%	—
Claude-instant	CoT	31.5%	24.8%	28.2%	+0.7
GPT-3.5-turbo	VP	60.4%	55.8%	58.1%	—
GPT-3.5-turbo	CoT	67.4%	64.8%	66.1%	+8.0
GPT-4-32k	VP	64.0%	55.7%	59.9%	—
GPT-4-32k	CoT	65.6%	55.6%	60.5%	+1.8

Performance trends:

Zero-order (factual) accuracy: near-perfect.
Fast degradation at each higher ToM order: by $4^{\text{th}}$ order, accuracy for leading models drops to the 5–25% range.
Deception reduces accuracy by up to 30% at higher orders.

5. Error Taxonomy and Model Failure Modes

The benchmarks reveal characteristic, recurring LLM failure types:

Error Type	Description	Example Snippet
Insufficient Reasoning-Depth	Fails to perform all nested inference steps	Answers 1st for a 3rd-order query
Commonsense Errors	Violates real-world knowledge	“Aiden exited before … but still saw the move.”
Hallucinations	Unwarranted fabrication of details	“Benjamin lied about seeing a cat … real plan.”
Temporal Ignorance	Chronological confusion of events	“Lily exited before step 11 but …”
Spurious Causal Inference	Infers non-evident cause–effect relationships	“Private claim → no reason to doubt…”

Frequency of such errors increases markedly with ToM order. At the highest orders, failing to maintain recursion depth and temporal consistency is the primary limiting factor.

Additional observations from FANToM/HI-TOM:

Illusory ToM: LLMs may correctly answer free-form belief queries but fail structured answerability or information-access queries, indicating lack of internally consistent belief modeling.
Fact vs. Belief Suppression: Success on factual extraction often does not translate to suppression of omniscient information required for true ToM reasoning.
Format Sensitivity: Multiple-choice questions are easier for models than free-form questions; however, even multiple-choice accuracy often remains at chance.
Reasoning Complexity: Multi-step list questions (e.g., “Who knows the fact?”) are substantially harder than single-step questions.

6. Implications and Theoretical Insights

The HI-TOM suite delineates a strict empirical boundary between the shallow, associative “pattern recognition” characteristic of current LLMs (“System 1”) and the explicit, recursive, symbolic inference required by higher-order ToM (“System 2”) (He et al., 2023). The steep drop in performance as ToM order increases emphasizes that large neural LLMs frequently default to shortcut heuristics, skip inference steps, or propagate unsubstantiated information.

The benchmarks highlight the need for hybrid neural-symbolic architectures combining pattern recognition (LLM) modules with explicit symbolic belief trackers, particularly for high-order ToM tasks.
They serve as controlled testbeds for deception detection, empathetic conversational agents, and multi-party dialogue systems.
Conversely, HI-TOM-style tasks offer a unique lens on human cognition, although direct analogy must be drawn carefully due to fundamental architectural differences.

7. Limitations and Prospective Directions

Current HI-TOM results are bounded by several factors:

Only a limited set of LLMs (GPT-4, GPT-3.5, Claude, Guanaco) have been rigorously evaluated.
The impact of reinforcement learning from human feedback (RLHF), retrieval augmentation, or alternative fine-tuning regimes has not been established.
HI-TOM’s scripted stories, while logically consistent, may not capture the diversity or implicitness of real-world conversational belief cues.

Future research avenues include:

Extending benchmarks to genuine conversational transcripts and more implicit attributions.
Integrating explicit belief-tracking or symbolic inference mechanisms into LLM pipelines.
Examining cross-lingual ToM reasoning and multilingual transfer.
Assessing the effect of RLHF, retrieval augmentation, and fine-tuning on ToM depth and consistency.

HI-TOM marks a critical advance in ToM evaluation for LLMs, demarcating the limits of current architectures and charting a path forward for systems that more closely approximate human-level belief reasoning (Kim et al., 2023, He et al., 2023).

PDF Markdown Chat (Pro)

References (2)

FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions (2023)

HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to HI-TOM Benchmark.

HI-TOM Benchmark: Evaluating LLM Theory of Mind

1. Theoretical Underpinnings and Scope

2. Benchmark Construction and Task Design

FANToM/HI-TOM: Information-Asymmetric Dialogue

HI-TOM: Higher-Order Recursive Reasoning

3. Question Types, Formal Metrics, and Scoring

FANToM/HI-TOM Question Typology

HI-TOM Higher-Order Metrics

4. Empirical Results and Model Performance

FANToM/HI-TOM Performance (Kim et al., 2023)

HI-TOM Higher-Order Results (He et al., 2023)

5. Error Taxonomy and Model Failure Modes

6. Implications and Theoretical Insights

7. Limitations and Prospective Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

HI-TOM Benchmark: Evaluating LLM Theory of Mind

1. Theoretical Underpinnings and Scope

2. Benchmark Construction and Task Design

FANToM/HI-TOM: Information-Asymmetric Dialogue

HI-TOM: Higher-Order Recursive Reasoning

3. Question Types, Formal Metrics, and Scoring

FANToM/HI-TOM Question Typology

HI-TOM Higher-Order Metrics

4. Empirical Results and Model Performance

FANToM/HI-TOM Performance (Kim et al., 2023)

HI-TOM Higher-Order Results (He et al., 2023)

5. Error Taxonomy and Model Failure Modes

6. Implications and Theoretical Insights

7. Limitations and Prospective Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics