Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework (2008.02531v2)

Published 6 Aug 2020 in cs.CV

Abstract: We propose a self-supervised method to learn feature representations from videos. A standard approach in traditional self-supervised methods uses positive-negative data pairs to train with contrastive learning strategy. In such a case, different modalities of the same video are treated as positives and video clips from a different video are treated as negatives. Because the spatio-temporal information is important for video representation, we extend the negative samples by introducing intra-negative samples, which are transformed from the same anchor video by breaking temporal relations in video clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations. There are many flexible options in our IIC framework and we conduct experiments by using several different configurations. Evaluations are conducted on video retrieval and video recognition tasks using the learned video representation. Our proposed IIC outperforms current state-of-the-art results by a large margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, improvements can also be obtained on these two benchmark datasets. Code is available at https://github.com/BestJuly/Inter-intra-video-contrastive-learning.

PDF Abstract

Self-supervised Video Representation Learning Using Inter-Intra Contrastive Framework

This paper introduces a novel approach to self-supervised video representation learning by proposing the Inter-Intra Contrastive (IIC) framework. The approach focuses on leveraging both intra-sample and inter-sample learning methodologies to effectively capture temporal and spatial features from video data. The primary objective of this research is to address the challenges posed by video understanding tasks that require discriminative feature extractors, utilizing unlabeled video data to enhance the learning process without the need for costly manual annotation.

Core Methodology

The IIC framework is built upon the contrastive learning paradigm, where the core idea is to differentiate positive data pairs from negatives by using a suitable similarity measure. Traditional methods rely on positive-negative pairs derived from different samples in the dataset. Here, the authors introduce the concept of intra-negative samples, which are generated by disrupting the temporal continuity within the same video, thus augmenting the set of negative samples and enhancing temporal feature learning.

The framework involves:

Multiple View Integration: Employing two views of video data, typically the original RGB frames and an alternative modality such as optical flow or residual frames. The framework supports a multi-view approach, where these different perspectives are simultaneously considered for representation learning.
Intra-negative Sample Creation: Two methods are proposed—frame repeating, where a single frame is replicated, and temporal shuffling, which rearranges frame order. Both methods break the temporal sequence, creating intra-negative samples that are data-augmented from the same source video.
Contrastive Learning with Intra-Negatives: Utilizing a 3D convolutional network backbone, the framework optimizes contrastive loss functions that integrate these intra-negative samples alongside traditional inter-sample negatives, ensuring that the network focuses on capturing fine-grained temporal information.

Evaluation and Results

The authors conducted extensive experiments demonstrating that the IIC framework significantly outperforms existing state-of-the-art methods in video retrieval and recognition tasks. On the UCF101 and HMDB51 datasets, notable improvements were achieved, with top-1 accuracy gains reaching up to 16.7% and 9.5% on respective datasets. These results underline the framework's ability to extract highly discriminative video features, validated through various retrieval and recognition evaluations.

Implications and Future Directions

The proposed IIC framework presents substantial theoretical advancements in self-supervised learning by blending intra-sample and inter-sample methodologies. Practically, it offers a scalable solution for video understanding tasks by reducing dependency on labeled data, thus making it viable for adopting unlabeled video assets prevalent in real-world applications.

The authors hint at several future developments, including investigating the use of multi-modal setups involving more intricate view combinations and exploring distinct network architectures to further capitalize on the network's capacity to handle a variety of input modalities. Additionally, extending the framework to handle larger datasets and tackling more complex video classification tasks could substantially impact the domain of video understanding and AI applications at large.

In conclusion, this paper provides a compelling framework for video representation learning, demonstrating strong empirical results and offering a robust foundation for future research in leveraging contrastive learning techniques in self-supervised paradigms.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Li Tao (27 papers)
Xueting Wang (28 papers)
Toshihiko Yamasaki (74 papers)

Citations (96)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - BestJuly/IIC: Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework' (110 stars)