Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition (2404.02624v1)
Abstract: Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN). In addition, context-dependent adaptive topology as a neighborhood vertex information and attention mechanism leverages a model to better represent actions. In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN to effectively improve modeling ability to achieve state-of-the-art results on several datasets. We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node. These two are followed by multi-scale convolution network with dilations, which not only captures the long-range temporal dependencies of joints but also the long-range spatial dependencies (i.e., long-distance dependencies) of node temporal behaviors. They are combined into high-level spatial-temporal representations and output the predicted action with the softmax classifier.
- Thomas N. Kipf and MaxWelling. Semi-supervised classification with graph convolutional networks. In ICLR (Poster), 2016.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI Conference on Artificial Intelligence, 2018.
- Disentangling and unifying graph convolutions for skeleton-based action recognition. pages 143––152, 2020.
- Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20186––20196, June 2022.
- M. Cannici C. Plizzari and M. Matteucci. Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding, pages 208–209, 2021.
- Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2021.
- Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset. in 3DOR-10th Eurographics Workshop on 3D Object Retrieval, pages 1–6, 2017.
- Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, pages 1010–1019, 27-30 June 2016.
- Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2649–2656, 2014.
- Jian Cheng Lei Shi, Yifan Zhang and Hanqing Lu. Two stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019.
- Channel-wise topology refinement graph convolution for skeleton-based action recognition. IEEE International Conference on Computer Vision (ICCV), pages 13359–13368, 2021.
- Jian Cheng Lei Shi, Yifan Zhang and Hanqing Lu. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. https://arxiv.org/abs/1912.06971, 2019.
- Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. https://arxiv.org/abs/2208.10741, 2023.
- Leveraging spatio-temporal dependency for skeleton-based action recognition. https://arxiv.org/abs/2212.04761, 2023.
- Jian Cheng Lei Shi, Yifan Zhang and Hanqing Lu. Gdecoupled spatial-temporal attention network for skeleton-based action recognition. https://arxiv.org/abs/2007.03263, 2020.
- Stst: Spatialtemporal specialized transformer for skeleton-based action recognition. ACM International Conference on Multimedia (ACM MM), pages 3229–3237, 2021.
- Hypergraph transformer for skeleton-based action recognition. https://arxiv.org/abs/2211.09590, 2022.
- Nguyen Huu Bao Long. Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition. https://arxiv.org/abs/22312.03288, 2023.
- Focal and global spatial-temporal transformer for skeleton-based action recognition. https://arxiv.org/abs/2210.02693, 2022.
- Language knowledge-assisted representation learning for skeleton-based action recognition. https://arxiv.org/abs/2305.12398, 2023.
- Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Transactions on Multimedia, pages 811–823, 2023.
- Dynamic hand gesture recognition using improved spatio-temporal graph convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 32, no.9, 2022.
- Kai Chen Haodong Duan, Jiaqi Wang and Dahua Lin. Dg-stgcn: Dynamic spatial-temporal modeling for skeleton-based action recognition. https://arxiv.org/abs/2210.05895, 2022.
- Graph contrastive learning for skeleton-based action recognition. https://arxiv.org/abs/2301.10900, 2023.