Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMToM-QA: Multimodal Theory of Mind Question Answering (2401.08743v2)

Published 16 Jan 2024 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly LLMs, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by LLMs). BIP-ALM extracts unified representations from multimodal data and utilizes LLMs for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that LLMs and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and LLMs.

Introduction to Multimodal Theory of Mind Benchmarking

The concept of Theory of Mind (ToM) represents the ability to attribute mental states to others, enabling individuals to predict and understand behaviors. In the pursuit of advancing social intelligence within artificial intelligence, a significant focus has been placed on evaluating machine ToM using a variety of benchmarks. Until now, these assessments have predominantly utilized unimodal datasets, restricted to either video or text. In real-world interactions, however, humans draw upon both visual and linguistic information to assess others’ mental states. To bridge this gap, a comprehensive Multimodal Theory of Mind question answering benchmark, named MMToM-QA, has been developed.

Evaluating ToM in AI

The MMToM-QA benchmark is designed to evaluate machine ToM using both video and text modalities, reflecting human-like reasoning about another person’s beliefs, goals, and plans within household scenarios. This novel approach employs a unique combination of Bayesian inverse planning—typically utilized for video data—and LLMs to interpret and analyze multimodal data. By integrating these elements, the benchmark offers an advanced method to appraise the ToM capabilities of machines against human performance. It particularly focuses on the capacity of AI systems to process multifaceted mental state problems, such as belief tracking over time and goal inferences under differing belief conditions.

The Multimodal Framework

MMToM-QA introduces a mixed modality input consisting of videos and textual descriptions from a domestic environment, accompanied by questions related to the mental states of individuals in the scene. The problems necessitate integrating both modalities to answer correctly. Refinement and evaluation of the processes can be undertaken using a ground-truth annotated training set, allowing for a detailed comparison between machine-generated and human responses. Additionally, the procedural generation of synthetic human activity data ensures scalability and expedited evaluation for AI models using the MMToM-QA benchmark.

Insights and Implications

While established LLMs and multimodal models show limited ToM reasoning, BIP-ALM—the proposed multimodal ToM model—demonstrates superior performance by incorporating robust Bayesian planning and the versatile reasoning abilities of LLMs. This approach shines not only in interpreting observed actions in the context of hypothesized mental states but also conveys promising strides toward mirroring human judgment. The creation of MMToM-QA and BIP-ALM emphasizes the need for multimodal understanding in social intelligence and suggests that machine ToM can greatly benefit from a more nuanced, hybridized approach. It marks a significant leap forward in ToM research with the potential to inform future AI development across various applications, ultimately paving the way for more socially aware artificial agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp.  2425–2433, 2015.
  2. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009.
  3. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1(4):1–10, 2017.
  4. Mindcraft: Theory of mind modeling for situated dialogue in collaborative tasks. In Conference on Empirical Methods in Natural Language Processing, 2021.
  5. A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pp.  706–717. PMLR, 2022.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Stylepredict: Machine theory of mind for human driver behavior from trajectories. arXiv preprint arXiv:2011.04816, 2020.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  9. Kerstin Dautenhahn. Socially intelligent robots: dimensions of human–robot interaction. Philosophical transactions of the royal society B: Biological sciences, 362(1480):679–704, 2007.
  10. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  11. Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others. Advances in Neural Information Processing Systems, 34:9963–9976, 2021.
  12. Andrew S. Gordon. Commonsense interpretation of triangle behavior. In AAAI Conference on Artificial Intelligence, 2016.
  13. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, 2016.
  14. Exploring roberta’s theory of mind through textual entailment. 2021.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp.  9118–9147. PMLR, 2022.
  17. Julian Jara-Ettinger. Theory of mind as inverse reinforcement learning. Current Opinion in Behavioral Sciences, 29:105–110, 2019.
  18. The naïve utility calculus: Computational principles underlying commonsense psychology. Trends in cognitive sciences, 20(8):589–604, 2016.
  19. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  20. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  21. Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  22. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
  23. Revisiting the evaluation of theory of mind through question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.
  24. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
  25. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. Boosting theory-of-mind performance in large language models via prompting. arXiv preprint arXiv:2304.11490, 2023.
  28. Evaluating theory of mind in question answering. arXiv preprint arXiv:1808.09352, 2018.
  29. Phase: Physically-grounded abstract social events for machine social perception. In Proceedings of the aaai conference on artificial intelligence, volume 35, pp.  845–853, 2021.
  30. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  31. Proactive robot assistance via spatio-temporal object modeling. arXiv preprint arXiv:2211.15501, 2022.
  32. Watch-and-help: A challenge for social perception and human-ai collaboration. arXiv preprint arXiv:2010.09890, 2020.
  33. Nopa: Neurally-guided online probabilistic assistance for building socially intelligent home assistants. arXiv preprint arXiv:2301.05223, 2023.
  34. Machine theory of mind. In International conference on machine learning, pp.  4218–4227. PMLR, 2018.
  35. Multivent: Multilingual videos of events with aligned natural text. arXiv preprint arXiv:2307.03153, 2023.
  36. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  37. Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312, 2022.
  38. Rebecca Saxe. The happiness of the fish: Evidence for a common theory of one’s own and others’ actions. In Handbook of Imagination and Mental Simulation, pp.  257–309. Psychology Press, 2012.
  39. Symmetric machine theory of mind. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  19450–19466. PMLR, 17–23 Jul 2022a.
  40. Symmetric machine theory of mind. In International Conference on Machine Learning, pp.  19450–19466. PMLR, 2022b.
  41. Minding language models’ (lack of) theory of mind: A plug-and-play multi-character belief tracker, 2023.
  42. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763, 2023.
  43. Agent: A benchmark for core psychological reasoning. In International Conference on Machine Learning, pp.  9614–9625. PMLR, 2021.
  44. Mimoqa: Multimodal input multimodal output question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5317–5332, 2021.
  45. Multimodalqa: Complex question answering over text, tables and images. arXiv preprint arXiv:2104.06039, 2021.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  47. Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
  48. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  49. Towards mutual theory of mind in human-ai interaction: How language reflects what students perceive about a virtual teaching assistant. In Proceedings of the 2021 CHI conference on human factors in computing systems, pp.  1–14, 2021.
  50. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128, 1983.
  51. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8807–8817, 2019.
  52. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6720–6731, 2019.
  53. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chuanyang Jin (9 papers)
  2. Yutong Wu (25 papers)
  3. Jing Cao (17 papers)
  4. Jiannan Xiang (11 papers)
  5. Yen-Ling Kuo (22 papers)
  6. Zhiting Hu (74 papers)
  7. Tomer Ullman (12 papers)
  8. Antonio Torralba (178 papers)
  9. Joshua B. Tenenbaum (257 papers)
  10. Tianmin Shu (43 papers)
Citations (17)