Introduction to Multimodal Theory of Mind Benchmarking
The concept of Theory of Mind (ToM) represents the ability to attribute mental states to others, enabling individuals to predict and understand behaviors. In the pursuit of advancing social intelligence within artificial intelligence, a significant focus has been placed on evaluating machine ToM using a variety of benchmarks. Until now, these assessments have predominantly utilized unimodal datasets, restricted to either video or text. In real-world interactions, however, humans draw upon both visual and linguistic information to assess others’ mental states. To bridge this gap, a comprehensive Multimodal Theory of Mind question answering benchmark, named MMToM-QA, has been developed.
Evaluating ToM in AI
The MMToM-QA benchmark is designed to evaluate machine ToM using both video and text modalities, reflecting human-like reasoning about another person’s beliefs, goals, and plans within household scenarios. This novel approach employs a unique combination of Bayesian inverse planning—typically utilized for video data—and LLMs to interpret and analyze multimodal data. By integrating these elements, the benchmark offers an advanced method to appraise the ToM capabilities of machines against human performance. It particularly focuses on the capacity of AI systems to process multifaceted mental state problems, such as belief tracking over time and goal inferences under differing belief conditions.
The Multimodal Framework
MMToM-QA introduces a mixed modality input consisting of videos and textual descriptions from a domestic environment, accompanied by questions related to the mental states of individuals in the scene. The problems necessitate integrating both modalities to answer correctly. Refinement and evaluation of the processes can be undertaken using a ground-truth annotated training set, allowing for a detailed comparison between machine-generated and human responses. Additionally, the procedural generation of synthetic human activity data ensures scalability and expedited evaluation for AI models using the MMToM-QA benchmark.
Insights and Implications
While established LLMs and multimodal models show limited ToM reasoning, BIP-ALM—the proposed multimodal ToM model—demonstrates superior performance by incorporating robust Bayesian planning and the versatile reasoning abilities of LLMs. This approach shines not only in interpreting observed actions in the context of hypothesized mental states but also conveys promising strides toward mirroring human judgment. The creation of MMToM-QA and BIP-ALM emphasizes the need for multimodal understanding in social intelligence and suggests that machine ToM can greatly benefit from a more nuanced, hybridized approach. It marks a significant leap forward in ToM research with the potential to inform future AI development across various applications, ultimately paving the way for more socially aware artificial agents.