Weakly Supervised Action Localization by Sparse Temporal Pooling Network (1712.05080v2)

Published 14 Dec 2017 in cs.CV

Abstract: We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

Citations (341)

View on Semantic Scholar

Summary

The paper introduces a sparse temporal pooling network that combines attention mechanisms and adaptive pooling to efficiently localize actions without detailed annotations.
It develops Temporal Class Activation Maps to generate class-specific activation over time, enhancing detection precision in untrimmed videos.
The approach achieves state-of-the-art performance on THUMOS14 and strong results on ActivityNet1.3, reducing reliance on extensive temporal annotations.

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

The paper "Weakly Supervised Action Localization by Sparse Temporal Pooling Network" addresses the problem of action localization in untrimmed videos under weak supervision. This work leverages convolutional neural networks to temporally localize human actions without requiring detailed temporal annotations, using only video-level class labels during training. The approach uses a novel network architecture that combines an attention module with adaptive temporal pooling, aiming to identify a sparse selection of key video segments pertinent to the target actions.

Key Contributions

The primary contributions of the paper are noteworthy in several respects:

Network Architecture: The authors introduce a novel deep neural network designed to localize actions in an untrimmed video by focusing on a sparse set of representative segments. This sparseness is enforced through a loss function that combines classification error minimization with the sparsity of selected segments.
Temporal Class Activation Maps (T-CAMs): The paper presents a method for generating class-specific activation maps in the temporal domain, which are utilized to identify time intervals corresponding to target actions. This is a significant departure from relying solely on attention mechanisms, as T-CAMs offer class-specific localization information.
State-of-the-art Performance: The proposed method achieves state-of-the-art results on the THUMOS14 dataset and demonstrates superior performance on the ActivityNet1.3 dataset, despite the challenges posed by weak supervision.

Numerical Results

The experimental results presented in the paper highlight the effectiveness of the proposed approach. On the THUMOS14 dataset, the method outperforms existing weakly supervised techniques and competes favorably with fully supervised methods. Notably, it achieves an Average Precision (AP) of 35.5% at an Intersection over Union (IoU) threshold of 0.3, surpassing previous weakly supervised benchmarks. On ActivityNet1.3, the model records a mean Average Precision (mAP) of 20.07% on the testing set, setting a baseline for future studies in weakly supervised action localization on this dataset.

Implications and Future Work

The implications of this research are twofold:

Practical Implications: The reduction in the need for extensive manual annotations can greatly enhance the scalability of action localization systems, making them more feasible for real-world applications where labeling extensive video data is costly and time-consuming.
Theoretical Implications: The introduction of a sparsity-enforced attention mechanism coupled with temporal class activation mapping enriches the toolkit available for exploring weakly supervised learning paradigms in temporal sequence data.

Looking toward the future, several avenues for development are evident. One potential direction is the integration of finer-grained attention mechanisms that may further improve the precision of segment selection. Additionally, extending the network to handle multi-modal data inputs beyond RGB and optical flow could potentially enhance performance by capturing richer temporal dynamics.

In conclusion, the proposed Sparse Temporal Pooling Network demonstrates a robust method for action localization under weak supervision, showcasing both innovative ideas in network design and impressive empirical results. The reduction in dependency on temporal annotations positions this work as a significant milestone in the progression towards efficient, scalable video understanding systems in AI research.

PDF Markdown