Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition (2404.11903v1)
Abstract: The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.
- D. Li, T. Yao, L.-Y. Duan, T. Mei, and Y. Rui, “Unified spatio-temporal attention networks for action recognition in videos,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 416–428, 2018.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
- J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7083–7093.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision. Springer, Cham, 2016, pp. 20–36.
- Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4804–4814.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846.
- M. Patrick, D. Campbell, Y. Asano, I. Misra, F. Metze, C. Feichtenhofer, A. Vedaldi, and J. F. Henriques, “Keeping your eye on the ball: Trajectory attention in video transformers,” Advances in neural information processing systems, vol. 34, pp. 12 493–12 506, 2021.
- J. Choi, C. Gao, J. C. Messou, and J.-B. Huang, “Why can’t i dance in the mall? learning to mitigate scene bias in action recognition,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- R. Herzig, E. Ben-Avraham, K. Mangalam, A. Bar, G. Chechik, A. Rohrbach, T. Darrell, and A. Globerson, “Object-region video transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3148–3159.
- P. Sun, B. Wu, X. Li, W. Li, L. Duan, and C. Gan, “Counterfactual debiasing inference for compositional action recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3220–3228.
- T. S. Kim, J. Jones, and G. D. Hager, “Motion guided attention fusion to recognize interactions from videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 076–13 086.
- R. Yan, L. Xie, X. Shu, L. Zhang, and J. Tang, “Progressive instance-aware feature learning for compositional action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-else: Compositional action recognition with spatial-temporal interaction networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1049–1059.
- E. Ben Avraham, R. Herzig, K. Mangalam, A. Bar, A. Rohrbach, L. Karlinsky, T. Darrell, and A. Globerson, “Bringing image scene structure to video via frame-clip consistency of object tokens,” Advances in Neural Information Processing Systems, vol. 35, pp. 26 839–26 855, 2022.
- B. Xu, J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Interact as you intend: Intention-driven human-object interaction detection,” IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1423–1432, 2019.
- C. Zhang, A. Gupta, and A. Zisserman, “Is an object-centric video representation beneficial for transfer?” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 1976–1994.
- S. Fang, Z. Lin, K. Yan, J. Li, X. Lin, and R. Ji, “Hodn: Disentangling human-object feature for hoi detection,” Trans. Multi., vol. 26, p. 3125–3136, aug 2023. [Online]. Available: https://doi.org/10.1109/TMM.2023.3307896
- B. Xu, J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Interact as you intend: Intention-driven human-object interaction detection,” IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1423–1432, 2020.
- G. Radevski, M.-F. Moens, and T. Tuytelaars, “Revisiting spatio-temporal layouts for compositional action recognition,” arXiv preprint arXiv:2111.01936, 2021.
- X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 399–417.
- J. Liu, X. Wang, C. Wang, Y. Gao, and M. Liu, “Temporal decoupling graph convolutional network for skeleton-based gesture recognition,” IEEE Transactions on Multimedia, vol. 26, pp. 811–823, 2024.
- Y. Ou, L. Mi, and Z. Chen, “Object-relation reasoning graph for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 133–20 142.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 213–229.
- Y. Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo, H. Li, and S. Gould, “The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 847–859.
- C. Feichtenhofer, “X3d: Expanding architectures for efficient video recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 203–213.
- F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori, “Object level visual reasoning in videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 105–121.
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- H. Qiu, H. Li, Q. Wu, F. Meng, L. Xu, K. N. Ngan, and H. Shi, “Hierarchical context features embedding for object detection,” IEEE Transactions on Multimedia, vol. 22, no. 12, pp. 3039–3050, 2020.
- J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatio-temporal attention networks for action recognition and detection,” IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990–3001, 2020.
- X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation for video object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 408–417.
- X. Zhu, J. Dai, L. Yuan, and Y. Wei, “Towards high performance video object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7210–7218.
- S. Wang, Y. Zhou, J. Yan, and Z. Deng, “Fully motion-aware network for video object detection,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 542–557.
- Q. Qi, Y. Yan, and H. Wang, “Class-aware dual-supervised aggregation network for video object detection,” IEEE Transactions on Multimedia, vol. 26, p. 2109–2123, jul 2023. [Online]. Available: https://doi.org/10.1109/TMM.2023.3292615
- Z. Jiang, P. Gao, C. Guo, Q. Zhang, S. Xiang, and C. Pan, “Video object detection with locally-weighted deformable neighbors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8529–8536.
- L. Han and Z. Yin, “Global memory and local continuity for video object detection,” Trans. Multi., vol. 25, p. 3681–3693, jan 2023. [Online]. Available: https://doi.org/10.1109/TMM.2022.3164253
- H. Deng, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Object guided external memory network for video object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6678–6687.
- Y. Chen, Y. Cao, H. Hu, and L. Wang, “Memory enhanced global-local aggregation for video object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 337–10 346.
- C. Guo, B. Fan, J. Gu, Q. Zhang, S. Xiang, V. Prinet, and C. Pan, “Progressive sparse local attention for video object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3909–3918.
- D. Tu, W. Sun, X. Min, G. Zhai, and W. Shen, “Video-based human-object interaction detection from tubelet tokens,” arXiv preprint arXiv:2206.01908, 2022.
- M.-J. Chiou, C.-Y. Liao, L.-W. Wang, R. Zimmermann, and J. Feng, “St-hoi: A spatial-temporal baseline for human-object interaction detection in videos,” in Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval, pp. 9–17.
- Y.-L. Li, H. Fan, Z. Qiu, Y. Dou, L. Xu, H.-S. Fang, P. Guo, H. Su, D. Wang, W. Wu et al., “Discovering a variety of objects in spatio-temporal human-object interactions,” arXiv preprint arXiv:2211.07501, 2022.
- S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human-object interactions by graph parsing neural networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 401–417.
- J. Ji, R. Desai, and J. C. Niebles, “Detecting human-object relationships in videos,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8086–8096.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
- M. Tamura, H. Ohashi, and T. Yoshinaga, “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 410–10 419.
- Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 3, 2022, pp. 2567–2575.
- X. Liu, Y.-L. Li, X. Wu, Y.-W. Tai, C. Lu, and C.-K. Tang, “Interactiveness field in human-object interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 113–20 122.
- H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 658–666.
- H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The” something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850.
- W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- Xunsong Li (2 papers)
- Pengzhan Sun (10 papers)
- Yangcen Liu (4 papers)
- Lixin Duan (51 papers)
- Wen Li (107 papers)