SF-Net: Single-Frame Supervision for Temporal Action Localization (2003.06845v6)

Published 15 Mar 2020 in cs.CV, cs.LG, and eess.IV

Abstract: In this paper, we study an intermediate form of supervision, i.e., single-frame supervision, for temporal action localization (TAL). To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action. This can significantly reduce the labor cost of obtaining full supervision which requires annotating the action boundary. Compared to the weak supervision that only annotates the video-level label, the single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. To make full use of such single-frame supervision, we propose a unified system called SF-Net. First, we propose to predict an actionness score for each video frame. Along with a typical category score, the actionness score can provide comprehensive information about the occurrence of a potential action and aid the temporal boundary refinement during inference. Second, we mine pseudo action and background frames based on the single-frame annotations. We identify pseudo action frames by adaptively expanding each annotated single frame to its nearby, contextual frames and we mine pseudo background frames from all the unannotated frames across multiple videos. Together with the ground-truth labeled frames, these pseudo-labeled frames are further used for training the classifier. In extensive experiments on THUMOS14, GTEA, and BEOID, SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization. Notably, SF-Net achieves comparable results to its fully-supervised counterpart which requires much more resource intensive annotations. The code is available at https://github.com/Flowerfan/SF-Net.

Authors (7)

Fan Ma (26 papers)
Linchao Zhu (78 papers)
Yi Yang (856 papers)
Shengxin Zha (6 papers)
Gourab Kundu (4 papers)
Matt Feiszli (30 papers)
Zheng Shou (16 papers)

Citations (129)

View on Semantic Scholar

Summary

The paper introduces single-frame supervision to significantly reduce annotation effort in temporal action localization.
The methodology employs actionness score prediction and pseudo label generation to effectively distinguish between action and background frames.
Experimental evaluations on datasets like THUMOS14 demonstrate that SF-Net approaches the performance of fully supervised methods.

Overview of SF-Net: Single-Frame Supervision for Temporal Action Localization

Temporal Action Localization (TAL) has emerged as a significant challenge in computer vision, bridging the gap between temporal proposals and accurate action classification in untrimmed videos. The SF-Net paper addresses TAL by introducing an intermediate form of supervision termed as single-frame supervision. This approach requires annotators to identify only a single frame within the temporal span of an action, enabling a substantial reduction in annotation labor compared to full supervision which demands precise boundary annotations.

Methodology

The novelty of SF-Net lies in leveraging single-frame supervision to create a system that significantly improves TAL performance with reduced resources. In contrast to fully supervised TAL, where full temporal annotations are provided, SF-Net utilizes a minimalistic single-frame annotation approach, offering a middle ground between weakly-supervised and fully-supervised methods. It consists of several components:

Actionness Score Prediction: SF-Net assigns each video frame an actionness score, which aids in understanding the likelihood of an action presence and assists in refining temporal boundaries during inference.
Pseudo Label Generation: The system generates pseudo action and background frames by expanding from the annotated frame and identifying unannotated frames across the video as potential backgrounds, respectively. This is crucial for enriching the dataset without additional annotation costs.
Classification and Actionness Modules: SF-Net comprises a classification module that outputs scores for each potential action frame and an actionness module that predicts the probability of frames containing actions. By jointly optimizing these components, SF-Net efficiently distinguishes between action and background frames, achieving a performance that approaches fully supervised methods.

Experimental Evaluation

SF-Net has been tested extensively on datasets such as THUMOS14, GTEA, and BEOID. The results demonstrate that SF-Net not only surpasses state-of-the-art weakly-supervised methods but also reaches performance levels comparable to fully supervised systems under certain conditions. The model's ability to localize actions without the full temporal scope for each instance marks a significant advancement in TAL research.

Implications and Future Directions

The practical implications of SF-Net are notable, especially in scenarios where precise boundaries of action instances are less critical. For example, in surveillance applications where identifying an event’s occurrence is more relevant than marking its exact duration, SF-Net is particularly advantageous by providing efficient and less resource-intensive solutions.

Theoretically, the approach opens up new avenues for exploring intermediate supervisional methodologies in TAL and potentially other tasks in computer vision. The emphasis on hybrid supervision paradigms may store considerable potential in reducing the annotation bottleneck while maintaining model performance.

Looking forward, the integration of more advanced feature representation techniques and dynamic thresholding strategies could enhance SF-Net’s adaptability to varied applications and datasets. Furthermore, expanding this strategy to other domains within AI, such as reinforcement learning and natural language processing, where exhaustive annotations are impractical, can be a promising area for future investigation.

PDF Markdown

Related Papers

GitHub

GitHub - Flowerfan/SF-Net (72 stars)