Video Representation Learning by Dense Predictive Coding (1909.04656v3)

Published 10 Sep 2019 in cs.CV

Abstract: The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

Citations (347)

View on Semantic Scholar

Summary

The paper introduces DPC, a self-supervised method that learns dense spatio-temporal embeddings by predicting future representations.
It uses a curriculum training approach to gradually extend prediction horizons and reduce reliance on optical flow cues.
Pretrained on Kinetics-400, DPC achieves state-of-the-art performance on UCF101 and HMDB51, significantly improving action recognition.

Video Representation Learning by Dense Predictive Coding

The paper "Video Representation Learning by Dense Predictive Coding" proposes a novel approach for self-supervised learning in videos that is particularly suitable for human action recognition tasks. The proposed methodology, Dense Predictive Coding (DPC), introduces a framework to learn dense spatio-temporal embeddings by predicting future representations from past video sequences.

Contributions

The authors make several key contributions in this work:

Dense Predictive Coding Framework: The DPC framework is designed for self-supervised learning that does not rely on labeled data. It focuses on predicting future representations by encoding spatio-temporal blocks using recurrent functions. This dense encoding supports robust spatio-temporal reasoning and captures semantic actions rather than low-level details.
Curriculum Training: A curriculum training approach is presented where the model gradually learns to predict further into the future using progressively less temporal context. This strategy prevents the model from relying on shortcuts like optical flow and encourages encoding slowly varying spatio-temporal signals resulting in semantic representations.
State-of-the-Art Self-Supervised Performance: The effectiveness of DPC is demonstrated through extensive experiments. When pretrained on the Kinetics-400 dataset, the DPC achieves significant improvements in action recognition tasks on UCF101 and HMDB51, outperforming all prior self-supervised methods and approaching the performance of models pre-trained on large labeled datasets like ImageNet.

Methodology

The core of DPC lies in its ability to predict future video frame embeddings rather than the exact frames. This is achieved by leveraging a contrastive loss mechanism akin to noise contrastive estimation, which discerns between correct future representation predictions and multiple distractors. Specifically:

Temporal Aggregation: The use of a ConvGRU layer aggregates past temporal features allowing the network to maintain temporal coherence and predict future frames without reconstructing exact appearances, handling the non-deterministic nature of future states.
Sequential Prediction with Dense Mapping: By predicting future representations in a sequential manner and maintaining spatial layout representations, the network avoids trivial solutions and learns semantic features, aiding generalization to downstream tasks such as action recognition.
Regularization and Augmentation: Frame-wise random augmentations disrupt low-level optical flow learning, further enforcing high-level semantic understanding.

Experimental Results

The paper underscores the scalability of DPC due to its superior performance when trained on large datasets like Kinetics-400, where it achieves 75.7% top-1 accuracy on UCF101 and 35.7% on HMDB51. The curriculum learning strategy effectively enhances the model's predictive capacity by incorporating progressive difficulty. Ablation studies depict the necessity of dense predictions and demonstrate that higher self-supervised learning accuracy correlates well with improved downstream task performance.

Implications and Future Directions

The paper's insights can have profound implications for the field of video representation learning, particularly in contexts where labeled data is scarce. The DPC method could potentially lead to more effective unsupervised and semi-supervised frameworks in video understanding. Future investigations might explore integrating DPC with other modalities such as optical flows or audio signals to augment the learned representations. Additionally, investigating alternative mechanisms for temporal aggregation beyond ConvGRUs and further extending the dataset scale could open up more possibilities for robust video understanding systems.

The Dense Predictive Coding framework challenges conventional approaches to video representation learning by leaning heavily on abstraction and semantic understanding, holding promise for more sophisticated future developments in AI.

PDF Markdown

Related Papers

YouTube

Show All Videos