Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semi-supervised Active Learning for Video Action Detection (2312.07169v3)

Published 12 Dec 2023 in cs.CV

Abstract: In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data along with informative sample selection for action detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning informative sample selection as well as semi-supervised learning pseudo label generation. First, we propose NoiseAug, a simple augmentation strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos. The code and models is publicly available at: \url{https://github.com/AKASH2907/semi-sup-active-learning}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3672–3680.
  2. Uncertainty-aware weakly supervised action detection from untrimmed videos. In European Conference on Computer Vision, 751–768. Springer.
  3. ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. ArXiv, abs/1911.09785.
  4. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
  6. A flexible model for training action localization with varying levels of supervision. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 950–961.
  7. Videocapsulenet: A simplified network for action detection. Advances in Neural Information Processing Systems.
  8. Not all labels are equal: Rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14492–14501.
  9. Guess where? Actor-supervision for spatiotemporal action localization. Computer Vision and Image Understanding, 192: 102886.
  10. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 1050–1059. PMLR.
  11. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6047–6056.
  12. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6546–6555.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  14. What do i annotate next? an empirical study of active learning for action localization. In Proceedings of the European Conference on Computer Vision (ECCV), 199–216.
  15. Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery. In BMVC, volume 2, 7.
  16. Cold-start active learning with robust ordinal matrix factorization. In International conference on machine learning, 766–774. PMLR.
  17. Active learning for large multi-class problems. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 762–769. IEEE.
  18. Consistency-based Semi-supervised Learning for Object detection. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  19. Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision, 3192–3199.
  20. Action Tubelet Detector for Spatio-Temporal Action Localization. In 2017 IEEE International Conference on Computer Vision (ICCV), 4415–4423.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  22. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in neural information processing systems, 32.
  23. End-to-End Semi-Supervised Learning for Video Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14700–14710.
  24. Temporal Ensembling for Semi-Supervised Learning. ArXiv, abs/1610.02242.
  25. Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 896.
  26. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
  27. Rethinking Pseudo Labels for Semi-Supervised Object Detection. ArXiv, abs/2106.00168.
  28. Adaptive active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 859–866.
  29. Actions as Moving Points. In arXiv preprint arXiv:2001.04608.
  30. Deep reinforcement active learning for human-in-the-loop person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6122–6131.
  31. Pointly-supervised action localization. International Journal of Computer Vision, 127(3): 263–281.
  32. Localizing actions from video labels and pseudo-annotations. arXiv preprint arXiv:1707.09143.
  33. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1979–1993.
  34. Realistic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31.
  35. Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 464–474.
  36. BAOD: budget-aware object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1247–1256.
  37. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035. Curran Associates, Inc.
  38. Multi-region Two-Stream R-CNN for Action Detection. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., Computer Vision – ECCV 2016, 744–759. Cham: Springer International Publishing. ISBN 978-3-319-46493-0.
  39. Pomerleau, D. A. 1989. Alvinn: An autonomous land vehicle in a neural network. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ….
  40. Sampling bias in deep active classification: An empirical study. arXiv preprint arXiv:1909.09389.
  41. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA), 3758–3765. IEEE.
  42. Are all Frames Equal? Active Sparse Labeling for Video Action Detection. In Advances in Neural Information Processing Systems.
  43. Semantic Segmentation with Active Semi-Supervised Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5966–5977.
  44. Semi-Supervised Learning with Ladder Network. ArXiv, abs/1507.02672.
  45. A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9): 1–40.
  46. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  47. UFO22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: A Unified Framework Towards Omni-supervised Object Detection. In European Conference on Computer Vision, 288–313. Springer.
  48. Gabriella: An online system for real-time activity detection in untrimmed security videos. In 2020 25th International Conference on Pattern Recognition (ICPR), 4237–4244. IEEE.
  49. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In International Conference on Learning Representations.
  50. Semi-supervised self-training of object detection models.
  51. Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. In NIPS.
  52. Self-supervised learning for videos: A survey. arXiv preprint arXiv:2207.00419.
  53. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489.
  54. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 596–608. Curran Associates, Inc.
  55. TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  56. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  57. Actor-Centric Relation Network. ArXiv, abs/1807.10982.
  58. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.
  59. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30.
  60. Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV), 391–408.
  61. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12): 2591–2600.
  62. Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197.
  63. STMixer: A One-Stage Sparse Action Detector. ArXiv, abs/2303.15879.
  64. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 585–601.
  65. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. ArXiv, abs/1809.03327.
  66. A Survey on Deep Semi-supervised Learning. ArXiv, abs/2103.00550.
  67. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 264–272.
  68. Semi-supervised Learning for Multi-label Video Action Detection. In Proceedings of the 30th ACM International Conference on Multimedia, 2124–2134.
  69. Glnet: Global local network for weakly supervised action localization. IEEE Transactions on Multimedia, 22(10): 2610–2622.
  70. TubeR: Tubelet Transformer for Video Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13598–13607.
  71. Temporal coherence for active learning in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aayush J Rana (4 papers)
  2. Akash Kumar (87 papers)
  3. Shruti Vyas (14 papers)
  4. Yogesh Singh Rawat (14 papers)
  5. Ayush Singh (24 papers)
Citations (3)