Papers
Topics
Authors
Recent
Search
2000 character limit reached

SiT-MLP: A Simple MLP with Point-wise Topology Feature Learning for Skeleton-based Action Recognition

Published 30 Aug 2023 in cs.CV | (2308.16018v4)

Abstract: Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. However, previous GCN-based methods rely on elaborate human priors excessively and construct complex feature aggregation mechanisms, which limits the generalizability and effectiveness of networks. To solve these problems, we propose a novel Spatial Topology Gating Unit (STGU), an MLP-based variant without extra priors, to capture the co-occurrence topology features that encode the spatial dependency across all joints. In STGU, to learn the point-wise topology features, a new gate-based feature interaction mechanism is introduced to activate the features point-to-point by the attention map generated from the input sample. Based on the STGU, we propose the first MLP-based model, SiT-MLP, for skeleton-based action recognition in this work. Compared with previous methods on three large-scale datasets, SiT-MLP achieves competitive performance. In addition, SiT-MLP reduces the parameters significantly with favorable results. The code will be available at https://github.com/BUPTSJZhang/SiT?MLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3d skeleton-based action recognition using learning method,” arXiv preprint arXiv:2002.05907, 2020.
  2. Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE Multimedia, vol. 19, no. 2, pp. 4–10, 2012.
  3. Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299.
  4. C. Wang and J. Yan, “A comprehensive survey of rgb-based and skeleton-based human action recognition,” IEEE Access, 2023.
  5. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  6. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  7. K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 183–192.
  8. K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu, “Decoupling gcn with dropgraph module for skeleton-based action recognition,” in Proceedings of the European Conference on Computer Vision.   Springer, 2020, pp. 536–553.
  9. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 143–152.
  10. X. Hao, J. Li, Y. Guo, T. Jiang, and M. Yu, “Hypergraph neural network for skeleton-based action recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 2263–2275, 2021.
  11. Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1113–1122.
  12. P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1112–1121.
  13. F. Ye, S. Pu, Q. Zhong, C. Li, D. Xie, and H. Tang, “Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition,” in Proceedings of the ACM International Conference on Multimedia, 2020, pp. 55–63.
  14. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 026–12 035.
  15. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 13 359–13 368.
  16. Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Constructing stronger and faster baselines for skeleton-based action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 1474–1488, 2022.
  17. H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 186–20 196.
  18. Q. Pan, Z. Zhao, X. Xie, J. Li, Y. Cao, and G. Shi, “View-normalized and subject-independent skeleton generation for action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  19. C. Wu, X.-J. Wu, and J. Kittler, “Graph2net: Perceptually-enriched graph learning for skeleton-based action recognition,” IEEE transactions on circuits and systems for video technology, vol. 32, no. 4, pp. 2120–2132, 2021.
  20. X. Xiong, W. Min, Q. Wang, and C. Zha, “Human skeleton feature optimizer and adaptive structure enhancement graph convolution network for action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 342–353, 2022.
  21. S. Miao, Y. Hou, Z. Gao, M. Xu, and W. Li, “A central difference graph convolutional operator for skeleton-based action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4893–4899, 2021.
  22. H. Liu, Z. Dai, D. So, and Q. V. Le, “Pay attention to mlps,” Advances in Neural Information Processing Systems, vol. 34, pp. 9204–9215, 2021.
  23. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  24. Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118.
  25. S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “An end-to-end spatio-temporal attention model for human action recognition from skeleton data,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  26. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2117–2126.
  27. C. Li, Q. Zhong, D. Xie, and S. Pu, “Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2018, pp. 786–792.
  28. M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol. 68, pp. 346–362, 2017.
  29. Y. Dang, F. Yang, and J. Yin, “Dwnet: Deep-wide network for 3d action recognition,” Robotics and Autonomous Systems, vol. 126, p. 103441, 2020.
  30. A. Banerjee, P. K. Singh, and R. Sarkar, “Fuzzy integral-based cnn classifier fusion for 3d skeleton action recognition,” IEEE transactions on circuits and systems for video technology, vol. 31, no. 6, pp. 2206–2216, 2020.
  31. Z. Huang, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “Spatio-temporal inception graph convolutional networks for skeleton-based action recognition,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2122–2130.
  32. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Decoupled spatial-temporal attention network for skeleton-based action recognition,” arXiv preprint arXiv:2007.03263, 2020.
  33. H. Zhou, Q. Liu, and Y. Wang, “Learning discriminative representations for skeleton based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 608–10 617.
  34. Y. Liu, H. Zhang, Y. Li, K. He, and D. Xu, “Skeleton-based human action recognition via large-kernel attention graph convolutional network,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2575–2585, 2023.
  35. C. Li, Q. Huang, and Y. Mao, “Dd-gcn: directed diffusion graph convolutional network for skeleton-based human action recognition,” in 2023 IEEE International Conference on Multimedia and Expo.   IEEE, 2023, pp. 786–791.
  36. Z. Tu, J. Zhang, H. Li, Y. Chen, and J. Yuan, “Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition,” IEEE Transactions on Multimedia, 2022.
  37. J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 444–10 453.
  38. Y. Zhu, G. Huang, X. Xu, Y. Ji, and F. Shen, “Selective hypergraph convolutional networks for skeleton-based action recognition,” in Proceedings of the International Conference on Multimedia Retrieval, 2022, pp. 518–526.
  39. C. Plizzari, M. Cannici, and M. Matteucci, “Skeleton-based action recognition via spatial and temporal transformer networks,” Computer Vision and Image Understanding, vol. 208, p. 103219, 2021.
  40. J. Zhang, W. Xie, C. Wang, R. Tu, and Z. Tu, “Graph-aware transformer for skeleton-based action recognition,” The Visual Computer, pp. 1–12, 2022.
  41. H. Liu, Y. Liu, Y. Chen, C. Yuan, B. Li, and W. Hu, “Transkeleton: Hierarchical spatial-temporal transformer for skeleton-based action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  42. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  43. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  44. I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 261–24 272, 2021.
  45. H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek et al., “Resmlp: Feedforward networks for image classification with data-efficient training,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  46. Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision permutator: A permutable mlp-like architecture for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1328–1334, 2022.
  47. C. Tang, Y. Zhao, G. Wang, C. Luo, W. Xie, and W. Zeng, “Sparse mlp for image recognition: Is self-attention really necessary?” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2344–2351.
  48. D. J. Zhang, K. Li, Y. Chen, Y. Wang, S. Chandra, Y. Qiao, L. Liu, and M. Z. Shou, “Morphmlp: A self-attention free, mlp-like backbone for image and video,” arXiv preprint arXiv:2111.12527, 2021.
  49. M. Go and H. Tachibana, “gswin: Gated mlp vision model with hierarchical structure of shifted window,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2023, pp. 1–5.
  50. M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual attention network,” arXiv preprint arXiv:2202.09741, 2022.
  51. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  52. C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention enhanced graph convolutional lstm network for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1227–1236.
  53. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive neural networks for high performance skeleton-based human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 1963–1978, 2019.
  54. K. Xu, F. Ye, Q. Zhong, and D. Xie, “Topology-aware convolutional neural network for efficient skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2866–2874.
  55. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019.
  56. J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2684–2701, 2019.
  57. J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view action modeling, learning and recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014, pp. 2649–2656.
  58. I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
  59. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  60. I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1012–1020.
  61. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  62. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.
  63. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.