Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models (2406.13763v1)

Published 19 Jun 2024 in cs.CV and cs.AI

Abstract: Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in LLMs. LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhawnen Chen (1 paper)
  2. Tianchun Wang (19 papers)
  3. Yizhou Wang (162 papers)
  4. Michal Kosinski (14 papers)
  5. Xiang Zhang (395 papers)
  6. Yun Fu (131 papers)
  7. Sheng Li (217 papers)