Multimodal Dual Attention Memory for Video Story Question Answering (1809.07999v1)

Published 21 Sep 2018 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

Authors (4)

Kyung-Min Kim (25 papers)
Seong-Ho Choi (2 papers)
Jin-Hwa Kim (42 papers)
Byoung-Tak Zhang (83 papers)

Citations (77)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Multimodal Dual Attention Memory for Video Story Question Answering (1809.07999v1)

Summary

Related Papers