Self-Supervised Learning for Semi-Supervised Temporal Action Proposal (2104.03214v1)

Published 7 Apr 2021 in cs.CV

Abstract: Self-supervised learning presents a remarkable performance to utilize unlabeled data for various video tasks. In this paper, we focus on applying the power of self-supervised methods to improve semi-supervised action proposal generation. Particularly, we design an effective Self-supervised Semi-supervised Temporal Action Proposal (SSTAP) framework. The SSTAP contains two crucial branches, i.e., temporal-aware semi-supervised branch and relation-aware self-supervised branch. The semi-supervised branch improves the proposal model by introducing two temporal perturbations, i.e., temporal feature shift and temporal feature flip, in the mean teacher framework. The self-supervised branch defines two pretext tasks, including masked feature reconstruction and clip-order prediction, to learn the relation of temporal clues. By this means, SSTAP can better explore unlabeled videos, and improve the discriminative abilities of learned action features. We extensively evaluate the proposed SSTAP on THUMOS14 and ActivityNet v1.3 datasets. The experimental results demonstrate that SSTAP significantly outperforms state-of-the-art semi-supervised methods and even matches fully-supervised methods. Code is available at https://github.com/wangxiang1230/SSTAP.

Citations (63)

View on Semantic Scholar

Summary

The paper presents SSTAP, a framework that integrates temporal-aware perturbations and relation-aware pretext tasks to enhance action proposal generation.
It employs the mean teacher model to merge self-supervised and semi-supervised strategies, significantly reducing the need for extensive labeled data.
Evaluated on THUMOS14 and ActivityNet v1.3, SSTAP achieves performance comparable to fully supervised methods even with only 60% labeled training data.

Self-Supervised Learning for Semi-Supervised Temporal Action Proposal

The paper "Self-Supervised Learning for Semi-Supervised Temporal Action Proposal" by Xiang Wang et al., presents a novel framework called Self-supervised Semi-supervised Temporal Action Proposal (SSTAP) designed to enhance the process of generating temporal action proposals in videos. The authors introduce this framework to leverage the advantages of self-supervised learning (SSL) and semi-supervised learning (SSL) in minimizing the dependency on large amounts of labeled data, which is often costly and labor-intensive to obtain.

Technical Contributions

Framework Design: The SSTAP framework integrates two complementary branches:
- A temporal-aware semi-supervised branch incorporating temporal feature shift and temporal feature flip perturbations. These are novel perturbation techniques aimed at increasing robustness and generalization of action proposal models.
- A relation-aware self-supervised branch utilizing pretext tasks such as masked feature reconstruction and clip-order prediction to exploit temporal clues and improve action feature discrimination.
Methodology Improvements: By employing the mean teacher model, this work extends semi-supervised learning paradigms in temporal action proposal tasks. The proposed temporal perturbations manifest as simple yet effective ways to model temporal dynamics, thus improving the existing perturbation-based semi-supervised learning methods.
Self-Supervised Pretext Tasks: The adoption of masked feature reconstruction and clip-order prediction tasks emphasizes distilling meaningful representations from unlabeled data by focusing on temporal relationships and sequence understanding.

Results and Implications

The framework was extensively evaluated on the THUMOS14 and ActivityNet v1.3 datasets. Experimental outcomes demonstrated that SSTAP surpasses state-of-the-art semi-supervised methods and even attains performance levels comparable to fully supervised methods. In terms of Average Recall (AR) and Intersection over Union (IoU) thresholds, SSTAP consistently delivered high-quality action proposals. Using a mere 60% of labeled training data, SSTAP matched or outperformed some fully supervised counterparts, emphasizing its efficacy in label-scarce environments.

Practical and Theoretical Implications

This research elucidates the potential for further reducing dependency on labeled data in temporal action recognition tasks, presenting practical applicability in domains such as surveillance and sports analytics where video data is abundant yet annotations are costly. Theoretically, it exemplifies a successful integration of SSL into SSL domains, extending our understanding of how unsupervised video structure exploration can enhance label-efficient learning. The perturbation techniques and pretext task design offer a blueprint for similar endeavors in other temporal sequence modeling problems.

Future Directions

Potential avenues for future work include:

Generalization Across Models: As demonstrated by applying SSTAP to G-TAD, the framework has cross-model applicability. Future research could explore its integration with a broader range of temporal action proposal methods or even other sequence-based tasks.
Enhanced Pretext Tasks: Investigating additional or more complex pretext tasks that capture richer semantic content or address finer temporal granularity could further enhance feature learning in this domain.
Transfer Learning: The use of self-supervised pretext tasks for transfer to other video understanding tasks could be an area for continued exploration, offering the possibility of using SSTAP-learned features in other contexts.

In conclusion, the SSTAP framework makes a significant contribution to the field by demonstrating that self-supervised methodologies can be effectively intertwined with semi-supervised learning for temporal action proposal tasks, setting a robust foundation for further explorations in this domain.

PDF Markdown

Related Papers

GitHub

GitHub - wangxiang1230/SSTAP: Code for our CVPR 2021 Paper "Self-Supervised Learning for Semi-Supervised Temporal Action Proposal". (71 stars)