Unsupervised Temporal Action Localization via Self-paced Incremental Learning (2312.07384v1)
Abstract: Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.
- P. Chen, C. Gan, G. Shen, W. Huang, R. Zeng, and M. Tan, “Relation attention for temporal action localization,” IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2723–2733, 2019.
- Y. Zhou, R. Wang, H. Li, and S.-Y. Kung, “Temporal action localization using long short-term dependency,” IEEE Transactions on Multimedia, vol. 23, pp. 4363–4375, 2020.
- C. Sun, H. Song, X. Wu, Y. Jia, and J. Luo, “Exploiting informative video segments for temporal action localization,” IEEE Transactions on Multimedia, vol. 24, pp. 274–287, 2021.
- H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal matching for video moment retrieval,” IEEE Transactions on Multimedia, vol. 24, pp. 1338–1349, 2021.
- Y. Zhai, L. Wang, W. Tang, Q. Zhang, N. Zheng, and G. Hua, “Action coherence network for weakly-supervised temporal action localization,” IEEE Transactions on Multimedia, vol. 24, pp. 1857–1870, 2022.
- B. Shi, Q. Dai, Y. Mu, and J. Wang, “Weakly-supervised action localization by generative attention modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2020, pp. 1009–1019.
- C. Zhang, M. Cao, D. Yang, J. Chen, and Y. Zou, “Cola: Weakly-supervised temporal action localization with snippet contrastive learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2021, pp. 16 010–16 019.
- Y. Hu, L. Nie, M. Liu, K. Wang, Y. Wang, and X.-S. Hua, “Coarse-to-fine semantic alignment for cross-modal moment localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5933–5943, 2021.
- P. Lee, Y. Uh, and H. Byun, “Background suppression network for weakly-supervised temporal action localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, no. 07. AAAI, 2020, pp. 11 320–11 327.
- W. Yang, T. Zhang, Y. Zhang, and F. Wu, “Uncertainty guided collaborative training for weakly supervised and unsupervised temporal action localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- G. Gong, X. Wang, Y. Mu, and Q. Tian, “Learning temporal co-attention models for unsupervised video action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2020, pp. 9819–9828.
- Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 1318–1327.
- F.-T. Hong, J.-C. Feng, D. Xu, Y. Shan, and W.-S. Zheng, “Cross-modal consensus network for weakly supervised temporal action localization,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2021, pp. 1591–1599.
- S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-talc: Weakly-supervised temporal activity localization and classification,” in Proceedings of the European Conference on Computer Vision. Springer, 2018, pp. 563–579.
- P. X. Nguyen, D. Ramanan, and C. C. Fowlkes, “Weakly-supervised action localization with background modeling,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2019, pp. 5502–5511.
- L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “Untrimmednets for weakly supervised action recognition and detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 4325–4334.
- D. Liu, T. Jiang, and Y. Wang, “Completeness modeling and context separation for weakly supervised temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2019, pp. 1298–1307.
- Y. Zhai, L. Wang, W. Tang, Q. Zhang, J. Yuan, and G. Hua, “Two-stream consensus network for weakly-supervised temporal action localization,” in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 37–54.
- M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” Proceedings of Advances in Neural Information Processing Systems, 2010.
- F. Ma, D. Meng, Q. Xie, Z. Li, and X. Dong, “Self-paced co-training,” in International Conference on Machine Learning. PMLR, 2017, pp. 2275–2284.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision. Springer, 2018, pp. 132–149.
- H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-identification: Clustering and fine-tuning,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 4, pp. 1–18, 2018.
- Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Bian, and Y. Yang, “Progressive learning for person re-identification with one example,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2872–2881, 2019.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 770–778.
- P. Nguyen, T. Liu, G. Prasad, and B. Han, “Weakly supervised action localization by sparse temporal pooling network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 6752–6761.
- H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The thumos challenge on action recognition for videos “in the wild”,” Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017.
- F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2015, pp. 961–970.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 6299–6308.
- Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “Autoloc: Weakly-supervised temporal action localization in untrimmed videos,” in Proceedings of the European Conference on Computer Vision. Springer, 2018, pp. 154–171.
- Z. Liu, L. Wang, Q. Zhang, Z. Gao, Z. Niu, N. Zheng, and G. Hua, “Weakly supervised temporal action localization through contrast based evaluation networks,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2019, pp. 3899–3908.
- Z. Liu, L. Wang, Q. Zhang, W. Tang, J. Yuan, N. Zheng, and G. Hua, “Acsnet: Action-context separation network for weakly supervised temporal action localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, no. 3. AAAI, 2021, pp. 2233–2241.
- A. Islam, C. Long, and R. Radke, “A hybrid attention mechanism for weakly-supervised temporal action localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, no. 2. AAAI, 2021, pp. 1637–1645.
- W. Luo, T. Zhang, W. Yang, J. Liu, T. Mei, F. Wu, and Y. Zhang, “Action unit memory network for weakly supervised temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2021, pp. 9969–9979.
- P. Lee, J. Wang, Y. Lu, and H. Byun, “Weakly-supervised temporal action localization by uncertainty modeling,” in Proceedings of the AAAI Conference on Artificial Intelligence, no. 3. AAAI, 2021, pp. 1854–1862.
- A. Pardo, H. Alwassel, F. Caba, A. Thabet, and B. Ghanem, “Refineloc: Iterative refinement for weakly-supervised action localization,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 2021, pp. 3319–3328.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.