Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition (2401.00409v1)

Published 31 Dec 2023 in cs.CV and cs.AI

Abstract: Human Interaction Recognition is the process of identifying interactive actions between multiple participants in a specific situation. The aim is to recognise the action interactions between multiple entities and their meaning. Many single Convolutional Neural Network has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In addition, the computational complexity of the Transformer cannot be ignored, and its ability to capture local information and motion features in the image is poor. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based streams that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Experimental results on diverse and challenging datasets, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Xing H, Burschka D, “Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network,” 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,, 2022: 5195-5201.
  2. Perez M, Liu J, Kot A C, “Interaction relational network for mutual action recognition,” IEEE Transactions on Multimedia, 2021, 24: 366-376.
  3. Raptis M, Sigal L, “Poselet key-framing: A model for human activity recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013: 2650-2657.
  4. Hochreiter S, Schmidhuber J, “Long short-term memory,” Neural computation, 1997, 9(8): 1735-1780.
  5. Du Y, Fu Y, Wang L, “Skeleton based action recognition with convolutional neural network,” 2015 3rd IAPR Asian conference on pattern recognition (ACPR). IEEE, 2015: 579-583.
  6. W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16, 2016, p. 3697–3703.
  7. J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in Computer Vision– ECCV 2016, Cham: Springer International Publishing, 2016, pp. 816–833.
  8. J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3671–3680.
  9. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2136–2145.
  10. J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton-based human action recognition with global context-aware attention lstm networks,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2018.
  11. B. Tekin, F. Bogo, and M. Pollefeys, “H+o: Unified egocentric recognition of 3d hand-object poses and interactions,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4506–4515.
  12. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 2018.
  13. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognitions,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3590–3598.
  14. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 018–12 027.
  15. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 140–149.
  16. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in 2021 IEEE International Conference on Computer Vision (ICCV), 2021, pp. 13 359–13 368.
  17. W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, ““Language supervised training for skeleton-based action recognition,” arXiv preprint, arXiv:2208.05318, 2022.
  18. S. Wang, Y. Zhang, M. Zhao, H. Qi, K. Wang, F. Wei, and Y. Jiang, “Skeleton-based action recognition via temporal-channel aggregation,” arXiv preprint, arXiv:2205.15936, 2022.
  19. J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 10444-10453.
  20. H.-G. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 154–20 164.
Citations (2)

Summary

We haven't generated a summary for this paper yet.