Multimodal Semantic Attention Network for Video Captioning (1905.02963v1)

Published 8 May 2019 in cs.CV

Abstract: Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Liang Sun (124 papers)
Bing Li (374 papers)
Chunfeng Yuan (35 papers)
Zhengjun Zha (24 papers)
Weiming Hu (91 papers)

Citations (11)

View on Semantic Scholar

Multimodal Semantic Attention Network for Video Captioning (1905.02963v1)

Related Papers