Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature (2303.12332v3)
Abstract: Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos simultaneously by taking only video-level labels as the supervision. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video that can provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature. First, we design a saliency inference module that exploits the variation relationship between temporal neighbor snippets to discover salient snippet-features, which can reflect the significant dynamic change in the video. Secondly, we introduce a boundary refinement module that enhances salient snippet-features through the information interaction unit. Then, a discrimination enhancement module is introduced to enhance the discriminative nature of snippet-features. Finally, we adopt the refined snippet-features to produce high-fidelity pseudo labels, which could be used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
- Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4724–4733.
- Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1130–1139.
- Dual-Evidential Learning for Weakly-supervised Temporal Action Localization. In Proceedings of the European Conference on Computer Vision, 192–208.
- ASM-Loc: Action-Aware Segment Modeling for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13925–13935.
- Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
- ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 961–970.
- Relational Prototypical Network for Weakly Supervised Temporal Action Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11053–11060.
- Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision, 8002–8011.
- Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3272–3281.
- THUMOS challenge: Action recognition with a large number of classes. https://www.crcv.ucf.edu/THUMOS14/.
- The Kinetics Human Action Video Dataset. CoRR, abs/1705.06950.
- Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations.
- Background Suppression Network for Weakly-Supervised Temporal Action Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11320–11327.
- Weakly-supervised Temporal Action Localization by Uncertainty Modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1854–1862.
- Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19914–19924.
- Learning Salient Boundary Feature for Anchor-free Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3320–3329.
- Single Shot Temporal Action Detection. In Proceedings of the 25th ACM International Conference on Multimedia, 988–996.
- BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In Proceedings of the European Conference on Computer Vision, 3–19.
- Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1298–1307.
- T-C3D: Temporal Convolutional 3D Network for Real-time Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 7138–7145.
- Large-scale Vehicle Re-identification in Urban Surveillance Videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE.
- The Blessings of Unlabeled Background in Untrimmed Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6176–6185.
- Gaussian Temporal Awareness Networks for Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 344–353.
- Action Unit Memory Network for Weakly Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9969–9979.
- Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning. In Proceedings of the European Conference on Computer Vision, 729–745.
- Adversarial Background-Aware Loss for Weakly-Supervised Temporal Activity Localization. In Proceedings of the European Conference on Computer Vision, 283–299.
- Collaborative Foreground, Background, and Action Modeling Network for Weakly Supervised Temporal Action Localization. IEEE Transactions on Circuits and Systems for Video Technology, 33(11): 6939–6951.
- Moon, T. 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine, 13(6): 47–60.
- D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13608–13617.
- 3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision, 8679–8687.
- Weakly Supervised Action Localization by Sparse Temporal Pooling Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6752–6761.
- RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3319–3328.
- W-TALC: Weakly-supervised Temporal Activity Localization and Classification. In Proceedings of the European Conference on Computer Vision, 563–579.
- Attentive Relational Networks for Mapping Images to Scene Graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3957–3966.
- stagNet: An Attentive Semantic RNN for Group Activity Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 104–120.
- Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 30: 2989–3004.
- STC-GAN: Spatio-Temporally Coupled Generative Adversarial Networks for Predictive Scene Parsing. IEEE Transactions on Image Processing, 29: 5420–5430.
- ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal Action Localization. arXiv preprint arXiv:2104.02967.
- Weakly-Supervised Action Localization by Generative Attention Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1009–1019.
- Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1049–1058.
- UntrimmedNets for Weakly Supervised Action Recognition and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4325–4334.
- Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9070–9078.
- Revisiting Anchor Mechanisms for Temporal Action Localization. IEEE Transactions on Image Processing, 29: 8535–8548.
- Uncertainty Guided Collaborative Training for Weakly Supervised Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 53–63.
- Graph Convolutional Networks for Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision, 7094–7103.
- Two-Stream Consensus Network for Weakly-Supervised Temporal ActionLocalization. In Proceedings of the European Conference on Computer Vision, 37–54.
- Adaptive Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4): 4136–4151.
- CoLA: Weakly-Supervised Temporal Action Localization With Snippet Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16010–16019.
- Temporal Action Detection with Structured Segment Networks. In 2017 IEEE International Conference on Computer Vision, 2914–2923.