Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition (2311.17118v2)
Abstract: Developing end-to-end action recognition models on long videos is fundamental and crucial for long-video action understanding. Due to the unaffordable cost of end-to-end training on the whole long videos, existing works generally train models on short clips trimmed from long videos. However, this ``trimming-then-training'' practice requires action interval annotations for clip-level supervision, i.e., knowing which actions are trimmed into the clips. Unfortunately, collecting such annotations is very expensive and prevents model training at scale. To this end, this work aims to build a weakly supervised end-to-end framework for training recognition models on long videos, with only video-level action category labels. Without knowing the precise temporal locations of actions in long videos, our proposed weakly supervised framework, namely AdaptFocus, estimates where and how likely the actions will occur to adaptively focus on informative action clips for end-to-end training. The effectiveness of the proposed AdaptFocus framework is demonstrated on three long-video datasets. Furthermore, for downstream long-video tasks, our AdaptFocus framework provides a weakly supervised feature extraction pipeline for extracting more robust long-video features, such that the state-of-the-art methods on downstream tasks are significantly advanced. We will release the code and models.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- How much temporal long-term context is needed for action segmentation? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10351–10361, 2023.
- Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48, 2009.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
- Tallformer: Temporal action localization with a long-memory transformer. In European Conference on Computer Vision, pages 503–521. Springer, 2022.
- The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3575–3584, 2019.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 971–980, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Timeception for complex action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 254–263, 2019a.
- Videograph: Recognizing minutes-long human activities in videos. arXiv preprint arXiv:1905.05143, 2019b.
- The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
- Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078–2086, 2014.
- Self-paced curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
- Weakly-guided self-supervised pretraining for temporal activity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1078–1086, 2023.
- Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014.
- Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
- D3g: Exploring gaussian prior for temporal sentence grounding with glance annotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13734–13746, 2023.
- Bridge-prompt: Towards ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19880–19889, 2022.
- Learning to detect concepts from webly-labeled video data. In IJCAI, 2017a.
- Leveraging multi-modal prior knowledge for large-scale concept learning in noisy web data. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pages 32–40, 2017b.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
- Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI conference on artificial intelligence, pages 11612–11619, 2020.
- An empirical study of end-to-end temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20010–20019, 2022a.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022b.
- Set-supervised action learning in procedural task videos via pairwise order consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19903–19913, 2022.
- Exploring relations in untrimmed videos for self-supervised learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(1s):1–21, 2022.
- Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
- Video transformer network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021.
- Pgt: A progressive method for training models on long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11379–11389, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2616–2625, 2020.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Are current long-term video understanding datasets long-term? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2967–2976, 2023.
- Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
- Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023.
- Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pages 399–417, 2018.
- Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
- Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2613–2623, 2022.
- Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
- Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
- Boundary-sensitive pre-training for temporal localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7220–7230, 2021.
- Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126:375–389, 2018.
- Rhyrnn: Rhythmic rnn for recognizing events in long and complex videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 127–144. Springer, 2020.
- Self-paced robust learning for leveraging clean labels in noisy data. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 6853–6860, 2020.
- Re2tal: Rewiring pretrained video backbones for reversible temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10637–10647, 2023.
- Self-paced learning for matrix factorization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 3196–3202, 2015.
- Temporal action detection with structured segment networks. In Proceedings of the IEEE international conference on computer vision, pages 2914–2923, 2017.
- Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15555–15564, 2022.
- Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV), pages 803–818, 2018.
- Graph-based high-order relation modeling for long-term action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8984–8993, 2021.
- Twinformer: Fine-to-coarse temporal modeling for long-term action recognition. IEEE Transactions on Multimedia, 2023.
- A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567, 2020.
- Jiaming Zhou (41 papers)
- Hanjun Li (7 papers)
- Kun-Yu Lin (24 papers)
- Junwei Liang (47 papers)