Papers
Topics
Authors
Recent
Search
2000 character limit reached

ToMBench: Benchmark for Theory of Mind

Updated 8 February 2026
  • ToMBench is a contamination-resistant benchmark designed to systematically assess Theory of Mind in language models through a broad taxonomy of tasks and finely grained abilities.
  • It employs a unified multiple-choice format with rigorous data contamination control, enabling automated, unbiased evaluation of social cognition.
  • LLM evaluations on ToMBench reveal strengths in tasks like faux-pas recognition but expose persistent gaps in deeper inferential reasoning compared to human performance.

ToMBench is a systematic, contamination-resistant benchmark designed to evaluate the Theory of Mind (ToM) capabilities of LLMs. ToM refers to the ability to attribute mental states such as beliefs, desires, emotions, and intentions to oneself and to others. ToMBench was created in response to limitations of prior evaluation protocols, which were overly narrow in scope, subjective in scoring, or contaminated by overlap with public ToM inventories. By introducing a broad taxonomy of tasks and abilities, a unified multiple-choice evaluation format, and a rigorously managed data pipeline, ToMBench provides an automated, large-scale framework for assessing LLMs' social cognition in both breadth and depth (Chen et al., 2024).

1. Motivation and Design Principles

ToMBench addresses three core challenges in empirical ToM assessment for LLMs:

  • Breadth of Coverage: Most prior benchmarks focused on isolated paradigms (e.g., false-belief reasoning), missing the multi-faceted nature of social cognition. ToMBench defines eight canonical ToM tasks and 31 distinct abilities, each grounded in classic psychological research.
  • Automated, Unbiased Evaluation: Open-ended response formats require costly, subjective human scoring and are vulnerable to prompt leakage. ToMBench uses multiple-choice questions with tightly designed distractors, enabling fully automated and unbiased accuracy assessment.
  • Data Contamination Control: Many standard ToM tasks appear verbatim in public model training corpora, undermining claims of generalization. All ToMBench items are authored from scratch in Chinese and then translated to English, with no overlap with published ToM inventories and strict separation of translation and key creation processes.

These design principles aim to provide a robust, large-scale, and leakage-resistant platform for longitudinally measuring LLM progress in social intelligence (Chen et al., 2024).

2. Task and Ability Taxonomy

ToMBench structures its assessment space along two orthogonal axes:

  • Tasks: Eight main ToM task types (T\mathcal{T}), each capturing distinct cognitive phenomena:
    1. Unexpected Outcome Test (UOT)
    2. Scalar Implicature Task (SIT)
    3. Persuasion Story Task (PST)
    4. False Belief Task (FBT)
    5. Ambiguous Story Task (AST)
    6. Hinting Test (HT)
    7. Strange Story Task (SST)
    8. Faux-pas Recognition Test (FRT)

Each sample presents a brief story, a targeted ToM question, and multiple answer choices (typically k=3k=3 or $4$).

  • Abilities: Thirty-one atomic social-cognition abilities (A\mathcal{A}) spanning six core ToM dimensions:
    • Emotion (7 abilities),
    • Desire (4),
    • Intention (4),
    • Knowledge (4),
    • Belief (6),
    • Non-literal Communication (6).

The ATOMS (“Abilities in the Theory-of-Mind Space”) framework underpins this ability decomposition.

All 2,860 questions in the benchmark are annotated with both a task and fine-grained ability label, supporting macro- and micro-level quantitative analysis (Chen et al., 2024).

3. Evaluation Protocol and Metrics

ToMBench employs a rigorous, noise-resistant scoring pipeline.

  • Multiple-choice Format: For each story-question pair, kk plausible answers are provided (one correct, (k1)(k-1) distractors). Distractors are semantically close, demanding genuine inferential reasoning rather than reliance on obvious cues.
  • Shuffling and Stability: To mitigate LLM positional bias (favoring certain labels, e.g., “A”), all non-GPT-4 models are presented each item five times with shuffled label order. The modal prediction across these trials is used as the final score.
  • Human Baseline: Twenty Mandarin-speaking graduate students answer all Chinese items in the same MCQ format, establishing a reference for human-level accuracy.

Metric summary:

Metric Definition Scope
Accuracy (per task) Acc(t)=1DtiDt1{y^i=yi}\mathrm{Acc}(t)=\frac{1}{|D_t|}\sum_{i\in D_t}\mathbf{1}\{\hat y_i=y_i\} Per T\mathcal{T}
Macro-averaged ability score MacroAcc ⁣A=1AaA1DaiDa1{y^i=yi}\mathrm{MacroAcc}_{\!\mathcal A} = \frac{1}{|\mathcal{A}|}\sum_{a\in \mathcal{A}} \frac{1}{|D_a|} \sum_{i\in D_a}\mathbf{1}\{\hat y_i=y_i\} Per A\mathcal{A}
Coherent-story accuracy Item is only correct if all related questions per story are correct Story-level coherence

No model is fine-tuned on ToMBench data; all results reflect zero-shot performance (Chen et al., 2024).

4. Dataset Construction and Contamination Management

ToMBench’s data pipeline is specifically constructed to prevent contamination and cultural bias:

  • Authoring: All stories, questions, and distractors are newly written in Chinese by trained ToM psychologists, drawing on but not copying classical paradigms.
  • Translation: English versions are generated using GPT-4 under a constrained prompt, which receives only the story/question text (no answers/distractors) and imposes stylistic and present-tense constraints. Human translators manually vet all English items.
  • Label Separation: Gold keys are kept entirely separate from English item generation, precluding leakage.
  • Quality Control: Two rounds of blind review result in final inter-annotator agreement exceeding 99.4%.

This careful process yields an inventory of 2,860 questions over 1,584 stories, with ≥20 test items per ability and category distributions outlined below.

Task Stories Items
UOT 100 300
SIT 100 200
PST 100 100
FBT 100 600
AST 100 200
HT 93 103
SST 201 407
FRT 140 560

All data is treated as test data, with no overlap between sampled items and LLM pretraining or prior benchmarks (Chen et al., 2024).

5. Experimental Results and Model Analysis

ToMBench’s large-scale evaluation (ten LLMs, including GPT-4, GPT-3.5 Turbo, Qwen-14B-Chat, Mixtral-8x7B, Mistral-7B, Baichuan2-13B, Llama2-13B, and ChatGLM3-6B) produces several key findings:

  • Human vs Model: Humans achieve overall accuracy ≈85.4%; best LLM (GPT-4-1106) achieves 75.3% (Chinese) / 74.0% (English), yielding a ∼10 percentage point gap.
  • Per-task Trends: LLMs are strong on Faux-pas Recognition (FRT: 76–90% for GPT-4) and Unexpected Outcome (UOT: 71–77%), but near the chance level on Scalar Implicature (SIT: <50%) and weak on Persuasion (PST: 50–60%). This distribution mirrors documented strengths in emotion attribution and weaknesses in advanced quantifier reasoning and persuasion.
  • Per-ability Trends: Non-literal Communication and Emotion are comparatively robust (65–84% for GPT-4); Knowledge-pretend play and percepts→knowledge are highly deficient (often <20%).
  • Story-level Coherence: Imposing a strict all-correct-per-story criterion drops LLM accuracy by ≥20 points, whereas human performance drops by only 13.6, exposing the brittleness of model scenario tracking.
  • Chain-of-Thought (CoT): Contrary to some prior claims, CoT prompting does not improve ToMBench performance and in some cases slightly degrades it, suggesting a misalignment between LLM decomposition strategies and social-cognitive reasoning.
  • Error Patterns: Analysis reveals that models often attend to salient surface tokens rather than abstract reasoning about belief or knowledge; for example, frequently failing in knowledge-pretend play situations by latching onto the names of objects rather than the epistemic state of characters.

These results indicate a persistent gap between LLMs and human-level ToM, particularly in domains requiring deep inference about others’ knowledge, perceptual access, or non-literal meaning (Chen et al., 2024).

6. Limitations and Future Directions

ToMBench’s current formulation introduces several known boundaries:

  • Text-only Modality: Tests are strictly textual, excluding visual or multimodal ToM phenomena such as spatial perspective-taking.
  • Task Range: The benchmark, while broad, omits some canonical ToM tasks (e.g., memory-based ToM, real-time updating).
  • Statistical Power: Some ability subcategories (minimum 20 items) may yield limited granularity for certain fine-grained comparisons.
  • Cultural and Linguistic Generality: Despite careful design, differences between languages and cultural context remain a source of potential bias.

Proposed extensions include multimodal ToM tests (image/video), broader linguistic and cultural coverage, enlarged datasets for under-represented abilities, and exploring novel prompting or fine-tuning methods specifically targeted at social reasoning (Chen et al., 2024).

7. Impact and Broader Significance

ToMBench establishes a new standard for Theory of Mind evaluation in LLMs by integrating rigorous psychological theory, robust engineering, and methodological care around contamination. By showing that even the strongest current LLMs (e.g., GPT-4) fall short of human ToM benchmarks—often by leveraging shallow heuristics or pattern-matching cues rather than genuine perspective-taking—ToMBench provides critical diagnostic infrastructure for future research. The framework supports detailed ablation studies, longitudinal tracking of model advances, and informed development of architectural or data-augmentation interventions to improve social cognition in artificial agents (Chen et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToMBench.