Proposal-based Temporal Action Localization with Point-level Supervision (2310.05511v1)
Abstract: Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.
- Weakly supervised action labeling in videos under ordering constraints. In European Conference on Computer Vision, pages 628–643. Springer, 2014.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Integration of experts’ and beginners’ machine operation experiences to obtain a detailed task model. IEICE TRANSACTIONS on Information and Systems, 104(1):152–161, 2021.
- You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In BMVC, volume 2, page 3, 2014.
- Li Ding and Chenliang Xu. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6508–6516, 2018.
- Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13925–13935, 2022.
- Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision, pages 137–153, 2016.
- Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8002–8011, 2021.
- Predicting gaze in egocentric video by learning task-dependent attention transition. In European Conference on Computer Vision, pages 754–769, 2018.
- Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing, 29:7795–7806, 2020a.
- An ego-vision system for discovering human joint attention. IEEE Transactions on Human-Machine Systems, 50(4):306–316, 2020b.
- Improving action segmentation via graph-based temporal reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14024–14034, 2020c.
- Compound prototype matching for few-shot action recognition. In European Conference on Computer Vision, pages 351–368, 2022.
- Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18908–18918, 2023.
- THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
- Divide and conquer for single-frame temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13455–13464, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Learning action completeness from points for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13648–13657, 2021.
- Temporal deformable residual networks for action segmentation in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6742–6751, 2018.
- Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023.
- Bridge-prompt: Towards ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19880–19889, 2022.
- Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11499–11506, 2020.
- Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
- Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3889–3898, 2019.
- Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1298–1307, 2019.
- Sf-net: Single-frame supervision for temporal action localization. In European conference on computer vision, pages 420–437. Springer, 2020.
- No need for a lab: Towards multi-sensory fusion for ambient assisted living in real-world living homes. In 16th International Conference on Computer Vision Theory and Applications, 2021.
- Inertial hallucinations–when wearable inertial devices start seeing things. arXiv preprint arXiv:2207.06789, 2022.
- Action recognition from single timestamp supervision in untrimmed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9915–9924, 2019.
- Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 122–132, 2020.
- Temporal action detection with global segmentation mask learning. arXiv preprint arXiv:2207.06580, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 485–494, 2021.
- Acm-net: Action context modeling network for weakly-supervised temporal action localization. arXiv preprint arXiv:2104.02967, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–171, 2018.
- Tvnet: Temporal voting network for action localization. In 17th International Conference on Computer Vision Theory and Applications, 2022.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021a.
- Proposal relation network for temporal action detection. Workshop of the IEEE/CVF conference on computer vision and pattern recognition, 2021b.
- An improved algorithm for tv-l 1 optical flow. In Statistical and geometrical approaches to visual motion analysis, pages 23–45. Springer, 2009.
- Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021a.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021b.
- G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020.
- Temporal action proposal generation with background constraint. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3054–3062, 2022a.
- Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11562–11572, 2021.
- Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14722–14732, 2022b.
- Deco: Decomposition and reconstruction for compositional temporal grounding via coarse-to-fine contrastive ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23130–23140, 2023.
- Fine-grained affordance annotation for egocentric hand-object interaction videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2155–2163, 2023.
- Graph convolutional module for temporal action localization in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6209–6223, 2021.
- Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16010–16019, 2021.
- Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021.