End-to-End Dense Video Captioning with Masked Transformer (1804.00819v1)

Published 3 Apr 2018 in cs.CV

Abstract: Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

PDF Abstract

End-to-End Dense Video Captioning with Masked Transformer

The paper "End-to-End Dense Video Captioning with Masked Transformer" addresses the complex task of generating textual descriptions for multiple events within untrimmed videos. Unlike traditional approaches that involve separate models for event detection and captioning, this paper proposes an integrated transformer model that facilitates direct interactions between visual and language components. This end-to-end approach inherently allows for more coherent and contextually accurate descriptions.

Methodology

The core architecture of the proposed model comprises a video encoder and two decoders—one for event proposals and the other for captioning. Self-attention mechanisms are employed within the transformer architecture to efficiently model long-range dependencies without the need for recurrent computations.

Video Encoder: Utilizes self-attention layers to derive contextual representations of video frames, enabling effective learning of frame-level temporal dependencies.
Proposal Decoder: Generates event proposals using temporal convolutional networks, leveraging explicit anchors and a mask prediction network that ensures coherence between detected proposals and language information.
Captioning Decoder: Employs a masked transformer network that constraints the attention mechanism to event-specific regions, guided by differentiable masks. This setup ensures that only relevant portions of the video influence the generated descriptions.

Strong Numerical Results

The effectiveness of the proposed method is demonstrated on two benchmark datasets: ActivityNet Captions and YouCookII. The model achieves METEOR scores of 10.12 and 6.58, respectively, which reflect its prowess in aligning language and visual components accurately across diverse video contexts.

Implications and Future Directions

From a practical standpoint, integrating the proposal and captioning stages allows for more robust and scalable deployment in environments where video data is abundant and continuously generated, such as surveillance or educational platforms. Theoretical implications include exploring the fusion of advanced LLMs with visual encoders to further enhance understanding and generation capabilities.

Future research could extend this work by incorporating more fine-grained object detection techniques within the video encoder, potentially improving descriptions in contexts where small and ambiguous objects are prevalent, as seen in datasets like YouCookII. Moreover, exploring variations in transformer architecture depth might unlock further performance gains without compromising computational efficiency.

In conclusion, this paper presents a comprehensive end-to-end framework that effectively bridges the gap between event detection and description in untrimmed videos, promising advancements in the domain of dense video captioning.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Luowei Zhou (31 papers)
Yingbo Zhou (81 papers)
Jason J. Corso (71 papers)
Richard Socher (115 papers)
Caiming Xiong (337 papers)

Citations (504)

View on Semantic Scholar

End-to-End Dense Video Captioning with Masked Transformer (1804.00819v1)