MuMA-ToM: Multi-modal Multi-Agent Theory of Mind (2408.12574v4)
Abstract: Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (LLM-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
- Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses. arXiv preprint arXiv:2406.05659.
- Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1(4): 1–10.
- Does the autistic child have a “theory of mind”? Cognition, 21(1): 37–46.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding. arXiv preprint arXiv:2404.13627.
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv preprint arXiv:2312.14238.
- ToMBench: Benchmarking Theory of Mind in Large Language Models. arXiv:2402.15052.
- ToMBench: Benchmarking Theory of Mind in Large Language Models. arXiv preprint arXiv:2402.15052.
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv preprint arXiv:2406.07476.
- Cohen, M. 2021. Exploring RoBERTa’s theory of mind through textual entailment.
- Preschool emotional competence: Pathway to social competence? Child Development, 74(1): 238–256.
- Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36.
- Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others. Advances in Neural Information Processing Systems, 34: 9963–9976.
- A Framework for Sequential Planning in Multi-Agent Settings. Journal of Artificial Intelligence Research, 24: 49–79.
- Gordon, A. 2016a. Commonsense interpretation of triangle behavior. In Proceedings of the aaai conference on artificial intelligence, volume 30.
- Gordon, A. S. 2016b. Commonsense Interpretation of Triangle Behavior. In AAAI Conference on Artificial Intelligence.
- Social evaluation by preverbal infants. Nature, 450(7169): 557–559.
- IPOMDP-Net: A Deep Neural Network for Partially Observable Multi-Agent Planning Using Interactive POMDPs. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 6062–6069.
- Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. arXiv preprint arXiv:2310.16755.
- TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models’ Theory-of-Mind. arXiv preprint arXiv:2407.01455.
- Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models. arXiv:2405.09605.
- Neural Amortized Inference for Nested Multi-agent Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 530–537.
- Mmtom-qa: Multimodal theory of mind question answering. arXiv preprint arXiv:2401.08743.
- FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions. arXiv preprint arXiv:2310.15421.
- Large Language Models are Zero-Shot Reasoners. ArXiv, abs/2205.11916.
- Kosinski, M. 2023. Theory of Mind May Have Spontaneously Emerged in Large Language Models. arXiv preprint arXiv:2302.02083.
- Revisiting the evaluation of theory of mind through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5872–5877.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv:2306.04387.
- An Infant-Cognition Inspired Machine Benchmark for Identifying Agency, Affiliation, Belief, and Intention. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46.
- An Infant-Cognition Inspired Machine Benchmark for Identifying Agency, Affiliation, Belief, and Intention.
- Visual Instruction Tuning.
- Relational visual representations underlie human social interaction recognition. Nature Communications, 14(1): 7317.
- Phase: Physically-grounded abstract social events for machine social perception. In Proceedings of the aaai conference on artificial intelligence, volume 35, 845–853.
- PHASE: PHysically-grounded Abstract Social Events for Machine Social Perception. In 35th AAAI Conference on Artificial Intelligence (AAAI).
- OpenAI. 2023. GPT-4 Technical Report. ArXiv, abs/2303.08774.
- NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models. arXiv:2407.10380.
- Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36.
- Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8494–8502.
- VirtualHome: Simulating Household Activities via Programs. arXiv:1806.07011.
- Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration. arXiv:2010.09890.
- NOPA: Neurally-guided Online Probabilistic Assistance for Building Socially Intelligent Home Assistants. arXiv preprint arXiv:2301.05223.
- Machine Theory of Mind. arXiv:1802.07740.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- EmoBench: Evaluating the Emotional Intelligence of Large Language Models. arXiv:2402.12071.
- MultiVENT: Multilingual Videos of Events with Aligned Natural Text. arXiv preprint arXiv:2307.03153.
- Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker. arXiv:2306.00924.
- Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 13960–13980. Toronto, Canada: Association for Computational Linguistics.
- Agent: A benchmark for core psychological reasoning. In International Conference on Machine Learning, 9614–9625. PMLR.
- Adventures in Flatland: Perceiving Social Interactions Under Physical Dynamics. In CogSci.
- Views Are My Own, But Also Yours: Benchmarking Theory of Mind using Common Ground. arXiv preprint arXiv:2403.02451.
- A Bayesian theory of mind approach to modeling cooperation and communication. Wiley Interdisciplinary Reviews: Computational Statistics, 16(1): e1631.
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. arXiv preprint arXiv:2405.11985.
- Social interactions as recursive mdps. In Conference on Robot Learning, 949–958. PMLR.
- Ullman, T. 2023. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks. arXiv:2302.08399.
- Help or hinder: Bayesian models of social goal inference. Advances in neural information processing systems, 22.
- Theory of Mind abilities of Large Language Models in Human-Robot Interaction: An Illusion? In Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 36–45.
- Meta-analysis of theory-of-mind development: The truth about false belief. Child Development, 72(3): 655–684.
- Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind Capabilities. arXiv:2311.10227.
- Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Linguistics: EMNLP 2023, 10691–10706. Singapore: Association for Computational Linguistics.
- OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models. arXiv preprint arXiv:2402.06044.
- SEED-Story: Multimodal Long Story Generation with Large Language Model. arXiv preprint arXiv:2407.08683.
- Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006.
- GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental Alignment. arXiv:2403.11075.
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858.
- Task Me Anything. arXiv preprint arXiv:2406.11775.
- Online bayesian goal inference for boundedly rational planning agents. Advances in neural information processing systems, 33: 19238–19250.