Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning (1905.01077v1)

Published 3 May 2019 in cs.CV

Abstract: It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TDConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8% to 67.2% on MSVD.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Jingwen Chen (21 papers)
Yingwei Pan (77 papers)
Yehao Li (35 papers)
Ting Yao (127 papers)
Hongyang Chao (34 papers)
Tao Mei (209 papers)

Citations (98)

View on Semantic Scholar

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning (1905.01077v1)

Related Papers