STMixer: A One-Stage Sparse Action Detector (2404.09842v1)
Abstract: Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.
- G. Gkioxari and J. Malik, “Finding action tubes,” in CVPR, 2015, pp. 759–768.
- J. Jiang, Y. Cao, L. Song, S. Zhang, Y. Li, Z. Xu, Q. Wu, C. Gan, C. Zhang, and G. Yu, “Human centric spatio-temporal action localization,” in CVPRW, 2018.
- P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatio-temporal action localization,” in ICCV, 2015, pp. 3164–3172.
- L. Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimation using hybrid fully convolutional networks,” in CVPR, 2016, pp. 2708–2717.
- X. Peng and C. Schmid, “Multi-region two-stream r-cnn for action detection,” in ECCV. Springer, 2016, pp. 744–759.
- R. Herzig, E. Levi, H. Xu, H. Gao, E. Brosh, X. Wang, A. Globerson, and T. Darrell, “Spatio-temporal action graph networks,” in ICCVW, 2019, pp. 0–0.
- C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in CVPR, 2018, pp. 6047–6056.
- A. Li, M. Thotakuri, D. A. Ross, J. Carreira, A. Vostrikov, and A. Zisserman, “The ava-kinetics localized human actions video dataset,” arXiv preprint arXiv:2005.00214, 2020.
- K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012.
- H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in ICCV. IEEE Computer Society, 2013, pp. 3192–3199.
- Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: A multi-person video dataset of spatio-temporally localized sports actions,” in ICCV. IEEE, 2021, pp. 13 516–13 525.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.
- S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in ECCV, 2018, pp. 305–321.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.
- D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 6450–6459.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019, pp. 6202–6211.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- D. Tran, H. Wang, M. Feiszli, and L. Torresani, “Video classification with channel-separated convolutional networks,” in ICCV. IEEE, 2019, pp. 5551–5560.
- S. Zhang, S. Guo, W. Huang, M. R. Scott, and L. Wang, “V4D: 4d convolutional neural networks for video-level representation learning,” in ICLR. OpenReview.net, 2020.
- C. Feichtenhofer, “X3D: expanding architectures for efficient video recognition,” in CVPR. Computer Vision Foundation / IEEE, 2020, pp. 200–210.
- H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in ICCV, 2021, pp. 6824–6835.
- Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in CVPR. IEEE, 2022, pp. 4794–4804.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Luči’c, and C. Schmid, “Vivit: A video vision transformer,” in ICCV, 2021, pp. 6836–6846.
- Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR. IEEE, 2022, pp. 3192–3201.
- Y. Zhang, X. Li, C. Liu, B. Shuai, Y. Zhu, B. Brattoli, H. Chen, I. Marsic, and J. Tighe, “Vidtr: Video transformer without convolutions,” in ICCV, 2021, pp. 13 577–13 587.
- R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y. Jiang, L. Zhou, and L. Yuan, “BEVT: BERT pretraining of video transformers,” in CVPR. IEEE, 2022, pp. 14 713–14 723.
- Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” in NeurIPS, 2022.
- L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “VideoMAE V2: Scaling video masked autoencoders with dual masking,” in CVPR, 2023.
- C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in ECCV, 2018, pp. 318–334.
- R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in CVPR, 2019, pp. 244–253.
- J. Tang, J. Xia, X. Mu, B. Pang, and C. Lu, “Asynchronous interaction aggregation for action detection,” in ECCV. Springer, 2020, pp. 71–87.
- J. Pan, S. Chen, M. Z. Shou, Y. Liu, J. Shao, and H. Li, “Actor-context-actor relation network for spatio-temporal action localization,” in CVPR, 2021, pp. 464–474.
- S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in NIPS, 2015, pp. 91–99.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017, pp. 2961–2969.
- J. Zhao, Y. Zhang, X. Li, H. Chen, B. Shuai, M. Xu, C. Liu, K. Kundu, Y. Xiong, D. Modolo, I. Marsic, C. G. M. Snoek, and J. Tighe, “Tuber: Tubelet transformer for video action detection,” in CVPR. IEEE, 2022, pp. 13 588–13 597.
- S. Chen, P. Sun, E. Xie, C. Ge, J. Wu, L. Ma, J. Shen, and P. Luo, “Watch only once: An end-to-end video action detection framework,” in ICCV, 2021, pp. 8178–8187.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
- Z. Gao, L. Wang, B. Han, and S. Guo, “Adamixer: A fast-converging query-based object detector,” in CVPR. IEEE, 2022, pp. 5354–5363.
- T. Wu, M. Cao, Z. Gao, G. Wu, and L. Wang, “Stmixer: A one-stage sparse action detector,” in CVPR. IEEE, 2023, pp. 14 720–14 729.
- C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, “Long-term feature banks for detailed video understanding,” in CVPR, 2019, pp. 284–293.
- G. J. Faure, M. Chen, and S. Lai, “Holistic interaction transformer network for action detection,” in WACV. IEEE, 2023, pp. 3329–3339.
- C. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, “Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition,” in CVPR. IEEE, 2022, pp. 13 577–13 587.
- J. Wu, Z. Kuang, L. Wang, W. Zhang, and G. Wu, “Context-aware rcnn: A baseline for action detection in videos,” in ECCV. Springer, 2020, pp. 440–456.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755.
- O. Köpüklü, X. Wei, and G. Rigoll, “You only watch once: A unified cnn architecture for real-time spatiotemporal action localization,” arXiv preprint arXiv:1911.06644, 2019.
- V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet detector for spatio-temporal action localization,” in ICCV. IEEE Computer Society, 2017, pp. 4415–4423.
- D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal and recognition networks for action detection,” in ECCV (6), ser. Lecture Notes in Computer Science, vol. 11210. Springer, 2018, pp. 306–322.
- J. Zhao and C. G. M. Snoek, “Dance with flow: Two-in-one stream action detection,” in CVPR. Computer Vision Foundation / IEEE, 2019, pp. 9935–9944.
- L. Song, S. Zhang, G. Yu, and H. Sun, “Tacnet: Transition-aware context network for spatio-temporal action detection,” in CVPR. Computer Vision Foundation / IEEE, 2019, pp. 11 987–11 995.
- R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN) for action detection in videos,” in ICCV. IEEE Computer Society, 2017, pp. 5823–5832.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg, “SSD: single shot multibox detector,” in ECCV (1), ser. Lecture Notes in Computer Science, vol. 9905. Springer, 2016, pp. 21–37.
- J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR. IEEE Computer Society, 2016, pp. 779–788.
- T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 318–327, 2020.
- R. B. Girshick, “Fast R-CNN,” in ICCV. IEEE Computer Society, 2015, pp. 1440–1448.
- R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR. IEEE Computer Society, 2014, pp. 580–587.
- K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015.
- X. Yang, X. Yang, M.-Y. Liu, F. Xiao, L. S. Davis, and J. Kautz, “Step: Spatio-temporal progressive learning for video action detection,” in CVPR, 2019, pp. 264–272.
- Y. Li, Z. Wang, L. Wang, and G. Wu, “Actions as moving points,” in ECCV. Springer, 2020, pp. 68–84.
- X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” CoRR, vol. abs/1904.07850, 2019.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: deformable transformers for end-to-end object detection,” in ICLR. OpenReview.net, 2021.
- P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r-cnn: End-to-end object detection with learnable proposals,” in CVPR, 2021, pp. 14 454–14 463.
- Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: query design for transformer-based detector,” in AAAI. AAAI Press, 2022, pp. 2567–2575.
- Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient DETR: improving end-to-end object detector with dense prior,” CoRR, vol. abs/2104.01318, 2021.
- D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional DETR for fast training convergence,” in ICCV. IEEE, 2021, pp. 3631–3640.
- H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H. Shum, “DINO: DETR with improved denoising anchor boxes for end-to-end object detection,” in ICLR. OpenReview.net, 2023.
- Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in CVPR, 2018, pp. 6154–6162.
- D. Zhou, Z. Liu, J. Wang, L. Wang, T. Hu, E. Ding, and J. Wang, “Human-object interaction detection via disentangled transformer,” in CVPR. IEEE, 2022, pp. 19 546–19 555.
- B. Kim, J. Lee, J. Kang, E. Kim, and H. J. Kim, “HOTR: end-to-end human-object interaction detection with transformers,” in CVPR. Computer Vision Foundation / IEEE, 2021, pp. 74–83.
- C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei, and J. Sun, “End-to-end human object interaction detection with HOI transformer,” in CVPR. Computer Vision Foundation / IEEE, 2021, pp. 11 825–11 834.
- J. Park, S. Lee, H. Heo, H. K. Choi, and H. J. Kim, “Consistency learning via decoding path augmentation for transformers in human object interaction detection,” in CVPR. IEEE, 2022, pp. 1009–1018.
- A. S. M. Iftekhar, H. Chen, K. Kundu, X. Li, J. Tighe, and D. Modolo, “What to look at and where: Semantic and spatial refined transformer for detecting human-object interactions,” in CVPR. IEEE, 2022, pp. 5343–5353.
- B. Kim, J. Mun, K. On, M. Shin, J. Lee, and E. Kim, “MSTR: multi-scale transformer for end-to-end human-object interaction detection,” in CVPR. IEEE, 2022, pp. 19 556–19 565.
- Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, and C. W. Chen, “Exploring structure-aware transformer over interaction proposals for human-object interaction detection,” in CVPR. IEEE, 2022, pp. 19 526–19 535.
- L. Dong, Z. Li, K. Xu, Z. Zhang, L. Yan, S. Zhong, and X. Zou, “Category-aware transformer network for better human-object interaction detection,” in CVPR. IEEE, 2022, pp. 19 516–19 525.
- X. Qu, C. Ding, X. Li, X. Zhong, and D. Tao, “Distillation using oracle queries for transformer-based human-object interaction detection,” in CVPR. IEEE, 2022, pp. 19 536–19 545.
- C. Zhang, J. Wu, and Y. Li, “Actionformer: Localizing moments of actions with transformers,” in ECCV (4), ser. Lecture Notes in Computer Science, vol. 13664. Springer, 2022, pp. 492–510.
- M. Xu, Y. Xiong, H. Chen, X. Li, W. Xia, Z. Tu, and S. Soatto, “Long short-term transformer for online action detection,” in NeurIPS, 2021, pp. 1086–1099.
- X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, and X. Bai, “End-to-end temporal action detection with transformer,” IEEE Trans. Image Process., vol. 31, pp. 5427–5441, 2022.
- Y. Weng, Z. Pan, M. Han, X. Chang, and B. Zhuang, “An efficient spatio-temporal pyramid transformer for action detection,” in ECCV (34), ser. Lecture Notes in Computer Science, vol. 13694. Springer, 2022, pp. 358–375.
- X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “Oadtr: Online action detection with transformers,” in ICCV. IEEE, 2021, pp. 7545–7555.
- J. Tan, J. Tang, L. Wang, and G. Wu, “Relaxed transformer decoders for direct action proposal generation,” in ICCV. IEEE, 2021, pp. 13 506–13 515.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR. OpenReview.net, 2021.
- Y. Li, H. Mao, R. B. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” CoRR, vol. abs/2203.16527, 2022.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019, pp. 8024–8035.
- I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” in NeurIPS, 2021, pp. 24 261–24 272.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in CVPR. IEEE Computer Society, 2017, pp. 5987–5995.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017.
- H. Fan, Y. Li, B. Xiong, W.-Y. Lo, and C. Feichtenhofer, “Pyslowfast,” https://github.com/facebookresearch/slowfast, 2020.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017, pp. 2117–2125.
- A. Arnab, X. Xiong, A. A. Gritsenko, R. Romijnders, J. Djolonga, M. Dehghani, C. Sun, M. Lucic, and C. Schmid, “Beyond transfer learning: Co-finetuning for action localisation,” CoRR, vol. abs/2207.03807, 2022.
- S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Deep learning for detecting multiple space-time action tubes in videos,” in BMVC. BMVA Press, 2016.
- G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Online real-time multiple spatiotemporal action localisation and prediction,” in ICCV, 2017, pp. 3637–3646.
- Y. Li, W. Lin, J. See, N. Xu, S. Xu, Y. Ke, and C. Yang, “CFAD: coarse-to-fine action detector for spatiotemporal action localization,” in ECCV (16), ser. Lecture Notes in Computer Science, vol. 12361. Springer, 2020, pp. 510–527.
- Z. Ning, Q. Xie, W. Zhou, L. Wang, and H. Li, “Person-context cross attention for spatio-temporal action detection,” Technical report, Huawei Noah’s Ark Lab, and University of Science and Technology of China, 2021.
- G. Singh, V. Choutas, S. Saha, F. Yu, and L. V. Gool, “Spatio-temporal action detection under large motion,” in WACV. IEEE, 2023, pp. 5998–6007.
- Tao Wu (127 papers)
- Mengqi Cao (2 papers)
- Ziteng Gao (12 papers)
- Gangshan Wu (70 papers)
- Limin Wang (221 papers)