Self-supervised Co-training for Video Representation Learning (2010.09709v2)

Published 19 Oct 2020 in cs.CV

Abstract: The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.

Citations (387)

View on Semantic Scholar

Summary

The paper introduces CoCLR, a co-training method that integrates semantic-class positives into instance-based contrastive training for enhanced video representations.
It leverages complementary RGB and optical flow streams to mine positive pairs across modalities, improving sample selection compared to traditional InfoNCE.
Evaluations on UCF101 and HMDB51 demonstrate that CoCLR achieves state-of-the-art performance in action recognition and video retrieval with less training data.

Self-supervised Co-training for Video Representation Learning

The paper in question presents a novel approach to the problem of self-supervised video representation learning, with a focus on exploiting the potential of complementary multimodal views. The key contributions include the incorporation of semantic-class positives into instance-based InfoNCE training, a novel self-supervised co-training scheme called CoCLR (Co-training Contrastive Learning of visual Representation), and an evaluation across various downstream tasks.

The Self-supervised Co-training method proposed leverages complementary RGB streams and optical flow streams in video data to enhance feature learning. By adopting this strategy, the research presents a means to mine positive class samples for one view using another view. This introduces a significant innovation beyond traditional instance discrimination approaches, like InfoNCE, by systematically improving the sampling process rather than proposing a new loss function.

Contributions

Incorporation of Semantic-Class Positives: The research demonstrates that integrating semantic-class positives in instance-based training leads to notable representation improvements. Through an oracle experiment, it shows the performance gap that is achievable when using positive samples selected via semantic labels, termed as UberNCE, compared to standard instance-based learning.
Self-supervised Co-training Scheme: CoCLR is introduced as a method that mines positive samples from different views within the dataset. The paper exploits RGB and optical flow views to bridge gaps between instance-based learning and supervised contrastive learning outcomes. CoCLR constructs positive pairs from semantically similar samples across different views, thereby improving upon the representation compared to InfoNCE, approaching the performance levels of UberNCE.
Downstream Task Evaluation: The paper thoroughly evaluates the learned representations on action recognition and video retrieval tasks using UCF101 and HMDB51 datasets. The CoCLR approach achieves either state-of-the-art or comparable performance to existing self-supervised approaches, all while requiring less data for pre-training.

Results and Implications

The experimental results demonstrate that self-supervised co-training using CoCLR allows for significant efficiency improvements. The work shows that models pre-trained with CoCLR achieve higher accuracy in several downstream tasks with less training data. Notably, CoCLR achieves impressive results in both linear probe settings for action recognition and in video retrieval tasks.

Future Directions and Implications

The methodologies presented in this paper pave the way for further exploration into leveraging multiple data views in representation learning. While the focus has been on video modalities, there exists potential for application across other data types, including images and multi-modal forms, such as audio-visual or text-visual learning. This perspective opens opportunities for more efficient learning systems that could be less dependent on extensive labeled datasets.

In terms of potential applications, the improved efficiency and accuracy of video representations can contribute significantly to the fields of video analysis and understanding, impacting real-world applications such as video surveillance, sports analytics, and human-computer interaction.

Conclusion

The paper's exploration into co-training complementary video views and the systematic improvement of sampling for contrastive learning marks a significant step in self-supervised learning research. The implications are profound, promising more data-efficient training regimens that achieve high performance on challenging tasks. CoCLR stands as an exemplary technique within the landscape of self-supervised video representation learning, with the potential to influence both theoretical research and practical applications.