Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Dual Attention Memory for Video Story Question Answering (1809.07999v1)

Published 21 Sep 2018 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kyung-Min Kim (25 papers)
  2. Seong-Ho Choi (2 papers)
  3. Jin-Hwa Kim (42 papers)
  4. Byoung-Tak Zhang (83 papers)
Citations (77)

Summary

We haven't generated a summary for this paper yet.