BigToM Belief Attribution Scenarios

Updated 22 November 2025

BigToM belief attribution scenarios are benchmarks and protocols designed to evaluate models’ capability to assign and track agents’ beliefs in text, symbolic, and multimodal settings.
They formalize belief states using time-indexed predicates and categorize tasks by zero, finite, or infinite belief history to assess epistemic memory requirements.
The framework employs advanced probing, activation steering, and hybrid symbolic–Bayesian methods to improve false-belief inference and overall belief tracking in complex scenarios.

BigToM (Big Theory of Mind) belief attribution scenarios define a class of benchmarks and experimental protocols for evaluating and dissecting computational models’ capacity to attribute and operate over others’ mental states, specifically their beliefs, in complex, multi-agent and multi-turn environments. These scenarios span text, symbolic, and multimodal domains, unifying the study of LLMs, symbolic logic engines, and hybrid Bayesian or deep learning systems under a single framework of structured, often intentionally adversarial, belief-manipulation and inference tasks.

1. Formal Taxonomy and Core Frameworks

BigToM scenarios systematically formalize the assignment and inference of belief states across multiple agents. A central construct is the agent-indexed time-dependent belief predicate $B_i^t(\varphi)$ , meaning “agent $i$ at time $t$ believes proposition $\varphi$ .” Tasks are broadly categorized by required access to prior belief states:

Zero Belief History (ZBH): $B_j^t(\varphi)$ is a function of only the immediate context $C^t$ , i.e., $B_j^t(\varphi) = f_0(C^t)$ .
Finite Belief History (FBH): $B_j^t(\varphi)$ depends on the last $N$ belief updates, i.e., $B_j^t(\varphi) = f_N(C^t, \{B_j^{t'}(\cdot)\}_{t'=t-N}^{t-1})$ .
Infinite Belief History (IBH): $B_j^t(\varphi)$ requires unbounded history. For all $N$ , $B_j^t(\varphi) \neq f_N(C^t, H_j^{[t-N,\,t)})$ (e.g., recursive or patterned beliefs over arbitrary depth).

This taxonomy, first delineated in Tang & Belle (2024) (Tang et al., 2024), is agnostic to modality, allowing instantiations in purely linguistic, symbolic, or richly embodied multimodal settings (e.g., object manipulation observed nonverbally in BOSS (Duan et al., 2022)). It provides precise definitions for what makes a belief-attribution scenario “hard” based on epistemic memory requirements, not merely syntactic nesting.

2. Benchmark Construction and Scenario Design

Scenario construction for BigToM tasks is governed by principles that induce genuine epistemic separation and complex causal structure among agents. Notable methodologies include:

ToMATO pipeline (Shinoda et al., 15 Jan 2025): Agents are role-played by LLMs under assigned personality profiles (Big Five vectors), scenario goals, and enforced information asymmetry. System prompts are:

Your name is {name}, a {age}-year-old {occupation}.
You are talking with {PartnerName}, a {age}-year-old {occupation}.
Goal: {goal}
Personality: You are {traits}.
At each turn, before you speak, think your {mental-state-type} in one sentence.
Then output: (your thought) "your utterance".

Each turn has agents verbalize their inner speech (first- or second-order beliefs), but only their utterance is revealed to the partner, thus generating scenarios with hidden and often false beliefs.

BigToM “Pick the Right Stuff” (Tang et al., 2024): Agents must track the location of objects/items under repeated, partially observed interventions, requiring the model to explicitly reconstruct which events have been observed by whom and at which time.
PercepToM methodology (Jung et al., 2024): False-belief questions are constructed such that correct belief attribution entails not only knowing “who saw what” (perception inference) but also possessing inhibitory control—filtering out unperceived context for the belief-holder before answering.
Formal DEL-based environments (Wu et al., 22 May 2025, Tang et al., 2024): An epistemic model (Kripke structure) is initialized describing all agents’ knowledge, and public/private event models are applied to track epistemic change, enabling arbitrarily deep nested belief queries.

A distinguishing feature of BigToM design is the careful decoupling of true-belief and false-belief questions, ordered belief (first, second, or higher), and explicit quantification of “who knows what when.” This precision permits robust evaluation over scenarios that classical ToM probes (e.g., Sally–Anne) only trivialize.

3. Model Architectures and Manipulation of Internal Belief Representations

State-of-the-art models for BigToM-style belief attribution fall into three classes:

Direct LLM Probing and Steering: Zhu et al. (2024) (Zhu et al., 2024) demonstrated that the internal activations of autoregressive LLMs linearly encode both protagonist (other’s) and oracle (model’s own) beliefs. Neural linear probes are trained per self-attention head via logistic regression to decode binary or joint belief states from token activations. Manipulating head activations along these learned directions (e.g., adding $\alpha\cdot\mathrm{std}(h)\cdot d$ at each step) can reliably and causally control the model’s ToM performance, doubling false-belief accuracy from baseline ($0.33$ to $0.66$) with only mild decrement on true-belief cases in Mistral-7B.
Symbolic/DEL Hybrid Execution: ToM-LM (Tang et al., 2024) externalizes deliberative belief reasoning: LLMs are fine-tuned to map NL narratives to dynamic-epistemic-logic scripts. These scripts specify initial observations, sequential public/private events, and belief queries, which are executed by SMCDEL—a DEL model checker—to provide transparent and verifiable belief attributions, supporting arbitrarily deep nesting and structured failure diagnosis.
Bayesian/Probabilistic Planning: Scalable Bayesian ToM planners (Zhang et al., 2 Jun 2025) decompose belief tracking into stepwise Bayesian updates:

$p(B_t | O_{1:t}) \propto p(O_t | B_t) \cdot p(B_t | O_{1:t-1}).$

Weak-to-strong control fuses a small LM (trained on ToM planning) with a larger, generalist LM, modulating likelihood estimation by a policy ratio. This enables large models to integrate ToM likelihoods with world/social knowledge, yielding improved generalization to multimodal scenes and challenging unseen situations.

4. Experimental Protocols and Key Empirical Findings

Evaluations in BigToM settings are characterized by controlled accuracy splits—true-belief (TB), false-belief (FB), first- and second-order, and traced epistemic trajectories. Notable benchmarks and results include:

Model/System	Benchmark	TB/FB Accuracies	Notable Findings
Mistral-7B	BigToM (Zhu et al. 2024)	TB 0.95, FB 0.33→0.66	Intervention steers FB up with mild TB drop
Gemma-3-4B	BigToM, CAA-steered (Chulo et al., 19 Nov 2025)	Baseline 32.5%, Steered 46.7% (FB)	Steering shifts model from analytical to emotional processing
ToM-LM + SMCDEL	MindGames, BigToM (Tang et al., 2024)	88–91%	Symbolic executions enable high performance
PercepToM	Percept-ToMi/FANToM (Jung et al., 2024)	up to 1.00 (TB), 0.566 (FB)	Explicit context filtering critical for FB
Bayesian ToM Planner	MMToM-QA, VirtualHome (Zhang et al., 2 Jun 2025)	81.3%	Weak-to-strong control boosts SOTA accuracy

Key observations:

LLMs struggle with false-belief and higher-order attribution unless provided with either explicit internal state intervention, explicit symbolic reasoning, or engineered inhibitory control.
Humans maintain TB/FB accuracy near 85–95% (ToMATO), while LLMs lag, especially on FB (e.g., GPT-4o mini: 1st-order~76%, 2nd-order FB~60%) (Shinoda et al., 15 Jan 2025).
Increasing model scale does not guarantee better FB performance, particularly in tasks demanding FBH or IBH memory (Tang et al., 2024).

5. Methodological Innovations: Probing, Steering, and Trace Verification

Multiple novel analysis and enhancement methods have been introduced in the BigToM research agenda:

Linear Probes and Neural Activation Steering:

Linear probes for belief states identify heads and layers encoding ToM-relevant information. Activation steering (e.g., along +T joint belief direction) leads to interpretable, causally effective model behavior shifts (Zhu et al., 2024).

Contrastive Activation Addition (CAA):

Comparison of baseline versus CAA-steered activations during BigToM tasks reveals that improvements stem from up-regulation of emotional perception and down-regulation of analytical (questioning, convergent thinking) activations. High emotional shift (Δμ) is strongly correlated (r = 0.82) with gains in belief attribution (Chulo et al., 19 Nov 2025).

Trace Selection via Process Belief Model (PBM) (Wu et al., 22 May 2025):

Candidate belief update traces from LLMs are scored stepwise via a verifier trained on DEL-generated ground truth. Weighted best-of-N trace selection, particularly with aggregation rules “min” or “prod”, yields substantial improvements in average-order accuracy for SLMs without retraining the base model.

6. Multimodal and Real-World Extensions

BigToM is not confined to text or symbolic logic. Benchmarks such as BOSS (Duan et al., 2022) and MMToM-QA (Zhang et al., 2 Jun 2025) extend the paradigm into visual and action-based environments:

BOSS: Multimodal video datasets with dense, framewise belief state annotation, combining gaze, gesture, pose, and object–context relations. Integration into hybrid Bayesian and deep neural architectures enables the extraction and prediction of belief states from nonverbal signals.
MMToM-QA/VirtualHome: Bayesian planners leverage both symbolic state representations from multimodal input (video + narration) and weak-to-strong LM control for complex, multi-agent belief inference.

These advances establish that belief attribution in the wild demands coordinated reasoning over observation, perspective, action, and history, often requiring memory-efficient implementations of FBH and, prospectively, IBH scenarios.

7. Current Limitations and Future Directions

BigToM-style benchmarks reveal persistent limitations in contemporary LLMs and hybrid systems:

Inadequate inhibitory control and context gating in standard LLM inference protocols (Jung et al., 2024).
Difficulty scaling to infinite belief history regimes—no public implementation yet operationalizes IBH in the full sense (Tang et al., 2024).
Symbolic reasoning engines (e.g., SMCDEL) face combinatorial blow-up with deep belief nesting, although practical efficiency for up to third-order beliefs is demonstrated (Tang et al., 2024).

Future work is oriented towards scalable architectures that unify the flexible perception and action grounding of deep models with the explicit, verifiable epistemic reasoning of symbolic frameworks. Benchmark designs increasingly emphasize dynamic, multi-modal, and high-order belief-tracking scenarios with latent or ambiguous world and mental state structure. The development of robust, process-verifying trace selection (e.g., PBM), activation-based control, and Bayesian update frameworks is anticipated to become standard for next-generation BigToM reasoning (Zhang et al., 2 Jun 2025, Wu et al., 22 May 2025, Zhu et al., 2024).