MMToM-QA: Multimodal Theory of Mind Question Answering (2401.08743v2)

Published 16 Jan 2024 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly LLMs, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by LLMs). BIP-ALM extracts unified representations from multimodal data and utilizes LLMs for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that LLMs and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and LLMs.

PDF HTML Abstract

Introduction to Multimodal Theory of Mind Benchmarking

The concept of Theory of Mind (ToM) represents the ability to attribute mental states to others, enabling individuals to predict and understand behaviors. In the pursuit of advancing social intelligence within artificial intelligence, a significant focus has been placed on evaluating machine ToM using a variety of benchmarks. Until now, these assessments have predominantly utilized unimodal datasets, restricted to either video or text. In real-world interactions, however, humans draw upon both visual and linguistic information to assess others’ mental states. To bridge this gap, a comprehensive Multimodal Theory of Mind question answering benchmark, named MMToM-QA, has been developed.

Evaluating ToM in AI

The MMToM-QA benchmark is designed to evaluate machine ToM using both video and text modalities, reflecting human-like reasoning about another person’s beliefs, goals, and plans within household scenarios. This novel approach employs a unique combination of Bayesian inverse planning—typically utilized for video data—and LLMs to interpret and analyze multimodal data. By integrating these elements, the benchmark offers an advanced method to appraise the ToM capabilities of machines against human performance. It particularly focuses on the capacity of AI systems to process multifaceted mental state problems, such as belief tracking over time and goal inferences under differing belief conditions.

The Multimodal Framework

MMToM-QA introduces a mixed modality input consisting of videos and textual descriptions from a domestic environment, accompanied by questions related to the mental states of individuals in the scene. The problems necessitate integrating both modalities to answer correctly. Refinement and evaluation of the processes can be undertaken using a ground-truth annotated training set, allowing for a detailed comparison between machine-generated and human responses. Additionally, the procedural generation of synthetic human activity data ensures scalability and expedited evaluation for AI models using the MMToM-QA benchmark.

Insights and Implications

While established LLMs and multimodal models show limited ToM reasoning, BIP-ALM—the proposed multimodal ToM model—demonstrates superior performance by incorporating robust Bayesian planning and the versatile reasoning abilities of LLMs. This approach shines not only in interpreting observed actions in the context of hypothesized mental states but also conveys promising strides toward mirroring human judgment. The creation of MMToM-QA and BIP-ALM emphasizes the need for multimodal understanding in social intelligence and suggests that machine ToM can greatly benefit from a more nuanced, hybridized approach. It marks a significant leap forward in ToM research with the potential to inform future AI development across various applications, ultimately paving the way for more socially aware artificial agents.

PDF Markdown Bookmark Chat (Pro)

References (53)

Authors (10)

Chuanyang Jin (9 papers)
Yutong Wu (25 papers)
Jing Cao (17 papers)
Jiannan Xiang (11 papers)
Yen-Ling Kuo (22 papers)
Zhiting Hu (74 papers)
Tomer Ullman (12 papers)
Antonio Torralba (178 papers)
Joshua B. Tenenbaum (257 papers)
Tianmin Shu (43 papers)

Citations (17)

View on Semantic Scholar

Tweets

https://twitter.com/chuanyang_jin/status/1748336997532315756

https://twitter.com/chuanyang_jin/status/1748337658982440965

https://twitter.com/knishimae0531/status/1748557127642182112