Mutual Theory of Mind Modelling

Updated 26 January 2026

Mutual Theory of Mind Modelling is a framework that enables human and AI agents to recursively infer and update each other’s mental states for dynamic collaboration.
It employs Bayesian methods, neural architectures, and reinforcement learning to integrate belief modeling and mutual adaptation across interactions.
Empirical findings show that explicit MToM integration improves prediction accuracy, task efficiency, and user trust in adaptive human-AI systems.

Mutual Theory of Mind (MToM) modelling formalizes the bidirectional process by which two or more agents—particularly humans and artificial agents—dynamically infer, represent, and update models of each other's mental states (beliefs, desires, intentions, competencies, and likely actions). In contrast to unilateral models, MToM emphasizes recursive, co-adaptive construction of mental-state representations, supporting robust, fluid, and context-aware collaboration. MToM frameworks span conceptual HCI-centric scaffolding, Bayesian cognitive modelling, neural architectures, reinforcement learning, and embodied robotics, each contributing technical mechanisms for recursive inference and mutual adaptation between interacting agents (Weisz et al., 2024).

1. Formal Definitions and Conceptual Foundations

MToM emerges from the classical Theory of Mind (ToM), which posits that an agent maintains a model of the mental states of another. MToM generalizes ToM by requiring both parties to actively build, interrogate, and update models about each other, yielding a bidirectional, dynamically-coupled process.

Let $H$ denote a human and $A$ an AI agent. The MToM state at time $t$ may be expressed as:

$\text{MToM}_t = \bigl\langle M_H(A),\, M_A(H),\, U_H,\, U_A \bigr\rangle$

where $M_H(A)$ is the human's model of the AI (its beliefs, strengths, limitations, and intentions), $M_A(H)$ is the AI's model of the human (beliefs, preferences, skills, and context), and $U_H, U_A$ are the respective update operations by which each agent revises its model in response to ongoing interactions (Weisz et al., 2024).

Formally, following interactive POMDP traditions, each agent $i$ may maintain $k$ th-order beliefs,

$B_i^k(t) = P_i\bigl(s(t), B_j^{k-1}(t) \mid h_i(t)\bigr)$

where $s(t)$ is the environment state, $h_i(t)$ is the agent’s own history, and $B_j^{k-1}$ is the $(k-1)$ th-order belief of the other agent $j$ (Yin et al., 3 Oct 2025).

The recursive update ensures that both agents reason not just about the world, but about how the other models the world and, recursively, each other.

2. Computational Approaches: Architectures and Mechanisms

MToM modelling is realized across several computational approaches, unifying memory, prediction, simulation, and proactive adaptation:

Latent Bayesian Modelling: Agents maintain belief hierarchies over the world and each other's beliefs. Update rules typically follow Bayesian filtering or variational inference, integrating observed signals, partner actions, and recursively embedded models (Pitliya et al., 1 Aug 2025, Yin et al., 3 Oct 2025).
Hybrid Neural Architectures: Architectures such as MToMnet deploy separate per-agent recurrent modules (MindNets) with cross-agent connections that fuse, communicate, or align latent states, enabling triadic inference over dynamic multimodal scenes. Mutual models are operationalized via re-ranking, state exchange, or shared ‘common ground’ vectors (Bortoletto et al., 2024).
BDI Hierarchies in Embodied Agents: Robot-centric pipelines (e.g., MindPower) sequence multimodal perception, layered belief/desire/intention reasoning for both self and other, and decision/action generation, typically implemented as iterative sequence-to-sequence prompting or structured latent flows within vision-LLMs. Explicit textual intermediate outputs foster auditability (Zhang et al., 28 Nov 2025).
RL with Intrinsic Mutual ToM Rewards: Multi-agent reinforcement learning may couple explicit belief prediction (first- and second-order) with policy learning, using cross-agent prediction error as an intrinsic reward signal. Policies and belief-encoders are often parameterized by deep neural architectures with mutual information constraints on latent representations (Oguntola et al., 2023).
Multi-Agent Inverse Planning (LLM-Amortized): Joint inference over nested mental-state hypotheses is performed by amortized Bayesian inverse planning, with LLM components serving as policy or likelihood approximators to reason over multimodal (video, text, dialogue) trajectories (Shi et al., 2024, Jin et al., 2024).

The implementation of full mutuality—i.e., nested $n$ th-order beliefs—remains computationally challenging; most practical systems truncate at second order or employ approximation via fused latent representations.

3. Diagnostics, Metrics, and Evaluation Benchmarks

MToM systems are evaluated on diverse axes, reflecting both agent-centric prediction quality and emergent system-level properties:

Belief prediction error: Accuracy in identifying the actual (possibly private or false) belief held by each agent, including higher-order (e.g., “A’s model of B’s belief”) false beliefs, measured against annotated ground-truth in multimodal datasets (Bortoletto et al., 2024, Jin et al., 2024, Shi et al., 2024).
Joint behavioral outcomes: System-level metrics such as task completion time, communicative efficiency, error recovery rate, and collaboration fluency in assembly, dialogue, or foraging tasks (Yin et al., 3 Oct 2025, Qiu et al., 2023).
Mutual adaptation and breakdown frequency: Frequency with which mutual models diverge (leading to misunderstanding or mis-coordination) versus align (yielding high-throughput, low-friction collaboration), often established via narrative analysis of breakdowns and repair strategies (Weisz et al., 2024).
Interpretable latent structure: Linear decodability of mental-state information from neural activations (e.g., LLM attention heads), separation of perspective-specific clusters, or probe-driven testing of model introspection (Li et al., 17 Jun 2025).
User-centric trust and calibration: Self-reported and behavioral indices of trust, anthropomorphism, and error response in human participants interacting with MToM-enabled agents (Wang et al., 2022).

Benchmark suites incorporate complex, grounded, and often multimodal scenarios. Examples include BOSS and TBD for nonverbal dyadic inference (Bortoletto et al., 2024); GridToM for perspective-sensitive belief reasoning in MLLMs (Li et al., 17 Jun 2025); and MuMA-ToM for multi-agent, multi-modal mental-state inference (Shi et al., 2024).

4. Empirical Findings and Technical Insights

Research demonstrates that mutuality and explicit higher-order modelling yield concrete gains in both predictive accuracy and interactional quality:

Explicit MToM integration (latent fusion, re-ranking, or common ground) can boost belief prediction and false-belief detection accuracy by 18% (BOSS) and achieve twofold gains on challenging “common mind” channels versus non-mutual baselines (Bortoletto et al., 2024).
In multimodal multi-agent scenarios (MuMA-ToM), LLM-based inverse-planning methods outperform state-of-the-art vision-LLMs by >20 percentage points, especially on tasks that require reasoning about others’ beliefs about goals (social-goal and belief-of-goal inference) (Shi et al., 2024).
In RL domains, intrinsic rewards based on second-order belief prediction yield ∼25% higher team performance and more stable collaborative conventions compared with policies trained without mutual ToM objectives (Oguntola et al., 2023, Freire et al., 2019).
Proactively disclosing and calibrating the agent’s model of the user (second-order transparency) supports error recovery and restores user trust after mispredictions, particularly when coupled with adaptive, rationale-aligned feedback (Wang et al., 2022, Weisz et al., 2024).
Attentional analyses in LLMs reveal the emergence of linearly-separable, agent-specific belief directions in neural space—enabling training-free mutual ToM interventions by targeted activation steering (Li et al., 17 Jun 2025).

5. Utopian Potentials, Dystopian Failures, and Design Principles

MToM architectures, if successfully operationalized, promise synergistic outcomes: increased human focus and flow, context-sensitive delegation, seamless cross-agent referrals, and strategic alignment across organizational routines (Weisz et al., 2024). However, design fictions and empirical analysis highlight severe risks from degraded or asymmetric mutual models:

Model misalignment or mis-handoff: Fragmented models (as in multi-bot settings lacking protocol for model transfer) precipitate loss of context, policy errors, and catastrophic system-level failures (e.g., wrongful HR actions) (Weisz et al., 2024).
Overapplication and over-trust: Overgeneralizing an agent’s MToM scope leads to erroneous high-stakes delegation, misrepresentation, and legal/ethical breakdowns (Weisz et al., 2024).
Lack of transparency in boundary conditions: Without clear signifiers and active user interrogation interfaces, drift in mutual models remains undetected, undermining repairability and user trust (Weisz et al., 2024, Wang et al., 2022).

Fundamental design principles for robust MToM systems include: explicit representation of both parties’ models and second-order beliefs, user query and correction affordances, clear provenance and actor signification in multi-agent systems, and comprehensive explanations of both action and rationale. Instrumenting lightweight, real-time diagnostic tools for ontology drift, alignment, and mutual calibration remains an open technical imperative.

6. Challenges, Limitations, and Future Research Directions

Despite empirical advances, full MToM operationalization remains limited by tractability, representation, and evaluation constraints:

Scalability: Combinatorial explosion in higher-order belief tracking demands architectural truncation (typically at second order) or lossy fusion (Oguntola et al., 2023, Bortoletto et al., 2024).
Interpretable updates: Formalization and auditability of update maps $U_H, U_A$ that enable human correction and oversight are not standardized (Weisz et al., 2024, Wang et al., 2022).
Generalization and transfer: Most architectures assume stationary user preferences and limited domain transfer. Model misspecification in dynamic, cross-task contexts challenges robust mutual adaptation (Pitliya et al., 1 Aug 2025).
Concrete empirical validation: Many proposals remain conceptual or demonstrated in simulation; in-the-wild, multi-modal, multi-agent deployments with rigorous outcome metrics are largely absent (Weisz et al., 2024, Zhang et al., 28 Nov 2025).

Directions for future work encompass integration into real-time, multi-agent and embodied systems; development of shared protocols for model transfer and negotiation; incorporation of emotional, preference, and moral reasoning into the latent state space; and principled metrics for degree and quality of mutual understanding. Bridging formal cognitive models with scalable neural policies and enhancing interpretability and controllability of recursive MToM updates in large-scale language-model-based agents are critical areas for ongoing research.