Learning Video Representations using Contrastive Bidirectional Transformer (1906.05743v2)

Published 13 Jun 2019 in cs.LG, cs.CV, and stat.ML

Abstract: This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods. Our method extends the BERT model for text sequences to the case of sequences of real-valued feature vectors, by replacing the softmax loss with noise contrastive estimation (NCE). We also show how to learn representations from sequences of visual features and sequences of words derived from ASR (automatic speech recognition), and show that such cross-modal training (when possible) helps even more.

Citations (231)

View on Semantic Scholar

Summary

The paper proposes a novel self-supervised method that leverages contrastive bidirectional transformers to learn robust video representations.
The method achieves significant improvements in video classification and segmentation, notably boosting accuracy on UCF101 and HMDB51 datasets.
It incorporates cross-modal training by fusing visual and ASR-generated textual features, advancing multi-modal video analysis.

Learning Video Representations with Contrastive Bidirectional Transformers

The paper "Learning Video Representations using Contrastive Bidirectional Transformer" introduces a novel approach for self-supervised learning of video representations, specifically focusing on improving performance in downstream tasks such as video classification, segmentation, and captioning. The methodology leverages a contrastive adaptation of the BERT model for sequence learning, which typically applies to text, and extends it to video sequences by employing noise contrastive estimation (NCE) rather than traditional softmax loss.

Method Overview

This work stands on the shoulders of the BERT model, known for its effectiveness in textual representation learning via the masked LLMing (MLM) technique. Adapting this to the domain of video, the authors circumvent the limitation of requiring discrete inputs by utilizing real-valued video frame vectors. The proposed model, termed the Contrastive Bidirectional Transformer (CBT), uses NCE to maximize mutual information between frames in a video sequence, thereby facilitating robust representation learning without dependence on pre-trained visual encoders or vector quantization, which precedes losses in detailed information.

Additionally, the model incorporates cross-modal training. It learns joint representations by fusing visual features from sequences of video frames with textual features derived from ASR-generated captions—effectively enhancing mutual information across modalities. This dual-modality learning is lightweight and adaptable, addressing potential misalignments between visual and auditory contents in videos without needing precise synchrony.

Experimental Results and Insights

The paper presents comprehensive experimental evaluations demonstrating the superior efficacy of the CBT model in learning both short-term and long-term video representations. Notable improvements in performance over preceding self-supervised video representation methodologies are observed:

Video Classification: The model outperformed state-of-the-art self-supervised methods significantly, achieving a notable performance boost from 75.7% to 79.5% on the UCF101 dataset and from 35.7% to 44.6% on HMDB51.
Action Anticipation, Segmentation, and Captioning: The ability to learn temporal representations from extensive untagged video datasets, such as HowTo100M, allows the model to excel in these advanced tasks. It demonstrates robustness in diverse settings, showcasing alignment in predicted action segments and captions, outperforming competing methods by large margins.

Ablation Studies

Through careful ablation studies, the authors elucidated the impacts of various aspects such as model depth, attention heads, cross-modal training, and dataset size used for pretraining. For instance, cross-modal training was shown to markedly enhance model performance, particularly on action anticipation tasks. The studies provide insights into scalability and parameter tuning, hinting at effective application possibilities even with larger volumes of unlabelled video data.

Contributions to the Field

The implications of this research are multi-faceted. Practically, this approach empowers the efficient learning of video representations without a need for labeled data or explicit video quantization, thus reducing costs and enhancing applicability across domains where labeled data is scarce. Theoretically, this seminal work bridges a crucial gap between modalities in video representation learning, potentially prompting new avenues in multi-modal learning paradigms.

Looking forward, the methodology and insights presented in this paper set a foundational step towards more versatile and scalable self-supervised models—in video and beyond—carrying the potential to parallel the advances seen with LLMs in NLP by eventually surpassing traditional supervised pretraining mechanisms.

In summary, this paper delivers a significant contribution to video representation learning, propelling the flexibility and capability of self-supervised models to new heights, and offering an efficient framework adaptable to expansive, uncurated video datasets.

PDF Markdown