Semi-supervised Active Learning for Video Action Detection (2312.07169v3)
Abstract: In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data along with informative sample selection for action detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning informative sample selection as well as semi-supervised learning pseudo label generation. First, we propose NoiseAug, a simple augmentation strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos. The code and models is publicly available at: \url{https://github.com/AKASH2907/semi-sup-active-learning}.
- Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3672–3680.
- Uncertainty-aware weakly supervised action detection from untrimmed videos. In European Conference on Computer Vision, 751–768. Springer.
- ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. ArXiv, abs/1911.09785.
- MixMatch: A Holistic Approach to Semi-Supervised Learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
- A flexible model for training action localization with varying levels of supervision. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 950–961.
- Videocapsulenet: A simplified network for action detection. Advances in Neural Information Processing Systems.
- Not all labels are equal: Rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14492–14501.
- Guess where? Actor-supervision for spatiotemporal action localization. Computer Vision and Image Understanding, 192: 102886.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 1050–1059. PMLR.
- AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6047–6056.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6546–6555.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- What do i annotate next? an empirical study of active learning for action localization. In Proceedings of the European Conference on Computer Vision (ECCV), 199–216.
- Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery. In BMVC, volume 2, 7.
- Cold-start active learning with robust ordinal matrix factorization. In International conference on machine learning, 766–774. PMLR.
- Active learning for large multi-class problems. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 762–769. IEEE.
- Consistency-based Semi-supervised Learning for Object detection. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision, 3192–3199.
- Action Tubelet Detector for Spatio-Temporal Action Localization. In 2017 IEEE International Conference on Computer Vision (ICCV), 4415–4423.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in neural information processing systems, 32.
- End-to-End Semi-Supervised Learning for Video Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14700–14710.
- Temporal Ensembling for Semi-Supervised Learning. ArXiv, abs/1610.02242.
- Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 896.
- Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
- Rethinking Pseudo Labels for Semi-Supervised Object Detection. ArXiv, abs/2106.00168.
- Adaptive active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 859–866.
- Actions as Moving Points. In arXiv preprint arXiv:2001.04608.
- Deep reinforcement active learning for human-in-the-loop person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6122–6131.
- Pointly-supervised action localization. International Journal of Computer Vision, 127(3): 263–281.
- Localizing actions from video labels and pseudo-annotations. arXiv preprint arXiv:1707.09143.
- Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1979–1993.
- Realistic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31.
- Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 464–474.
- BAOD: budget-aware object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1247–1256.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035. Curran Associates, Inc.
- Multi-region Two-Stream R-CNN for Action Detection. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., Computer Vision – ECCV 2016, 744–759. Cham: Springer International Publishing. ISBN 978-3-319-46493-0.
- Pomerleau, D. A. 1989. Alvinn: An autonomous land vehicle in a neural network. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ….
- Sampling bias in deep active classification: An empirical study. arXiv preprint arXiv:1909.09389.
- Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA), 3758–3765. IEEE.
- Are all Frames Equal? Active Sparse Labeling for Video Action Detection. In Advances in Neural Information Processing Systems.
- Semantic Segmentation with Active Semi-Supervised Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5966–5977.
- Semi-Supervised Learning with Ladder Network. ArXiv, abs/1507.02672.
- A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9): 1–40.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- UFO22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: A Unified Framework Towards Omni-supervised Object Detection. In European Conference on Computer Vision, 288–313. Springer.
- Gabriella: An online system for real-time activity detection in untrimmed security videos. In 2020 25th International Conference on Pattern Recognition (ICPR), 4237–4244. IEEE.
- In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In International Conference on Learning Representations.
- Semi-supervised self-training of object detection models.
- Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. In NIPS.
- Self-supervised learning for videos: A survey. arXiv preprint arXiv:2207.00419.
- Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489.
- FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 596–608. Curran Associates, Inc.
- TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Actor-Centric Relation Network. ArXiv, abs/1807.10982.
- Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30.
- Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV), 391–408.
- Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12): 2591–2600.
- Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197.
- STMixer: A One-Stage Sparse Action Detector. ArXiv, abs/2303.15879.
- Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 585–601.
- YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. ArXiv, abs/1809.03327.
- A Survey on Deep Semi-supervised Learning. ArXiv, abs/2103.00550.
- Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 264–272.
- Semi-supervised Learning for Multi-label Video Action Detection. In Proceedings of the 30th ACM International Conference on Multimedia, 2124–2134.
- Glnet: Global local network for weakly supervised action localization. IEEE Transactions on Multimedia, 22(10): 2610–2622.
- TubeR: Tubelet Transformer for Video Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13598–13607.
- Temporal coherence for active learning in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0.
- Aayush J Rana (4 papers)
- Akash Kumar (87 papers)
- Shruti Vyas (14 papers)
- Yogesh Singh Rawat (14 papers)
- Ayush Singh (24 papers)