Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Real-time 3D human action recognition based on Hyperpoint sequence (2111.08492v3)

Published 16 Nov 2021 in cs.CV

Abstract: Real-time 3D human action recognition has broad industrial applications, such as surveillance, human-computer interaction, and healthcare monitoring. By relying on complex spatio-temporal local encoding, most existing point cloud sequence networks capture spatio-temporal local structures to recognize 3D human actions. To simplify the point cloud sequence modeling task, we propose a lightweight and effective point cloud sequence network referred to as SequentialPointNet for real-time 3D action recognition. Instead of capturing spatio-temporal local structures, SequentialPointNet encodes the temporal evolution of static appearances to recognize human actions. Firstly, we define a novel type of point data, Hyperpoint, to better describe the temporally changing human appearances. A theoretical foundation is provided to clarify the information equivalence property for converting point cloud sequences into Hyperpoint sequences. Secondly, the point cloud sequence modeling task is decomposed into a Hyperpoint embedding task and a Hyperpoint sequence modeling task. Specifically, for Hyperpoint embedding, the static point cloud technology is employed to convert point cloud sequences into Hyperpoint sequences, which introduces inherent frame-level parallelism; for Hyperpoint sequence modeling, a Hyperpoint-Mixer module is designed as the basic building block to learning the spatio-temporal features of human actions. Extensive experiments on three widely-used 3D action recognition datasets demonstrate that the proposed SequentialPointNet achieves competitive classification performance with up to 10X faster than existing approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Q. Zhu, Z. Chen, and C. S. Yeng, “A novel semi-supervised deep learning method for human activity recognition,” Transactions on Industrial Informatics, vol. 17, no. 5, pp. 3821–3830, Jul. 2019.
  2. T. Huynh-The, C. H. Hua, and D. S. Kim, “Encoding pose features to images with data augmentation for 3-D action recognition” IEEE Transactions on Industrial Informatics. 2019.
  3. Z. Zuo, L. Yang, Y. Liu, F. Chao, R. Song, and Y. Qu, “Histogram of fuzzy local spatio-temporal descriptors for video action recognition,” IEEE Transactions on Industrial Informatics. 2019.
  4. Y. Min, Y. Zhang, X. Chai, and X. Chen, “An efficient pointlstm for point clouds based gesture recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  5. Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  6. X. Liu, M. Yan, and J. Bohg, “MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences,” Proceedings of the IEEE International Conference on Computer Vision, 2019.
  7. H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” International Conference on Learning Representations, 2021.
  8. H. Fan, Y. Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  9. X. Li, Q. Huang, T. Yang, and Q, Wu, “HyperpointNet for point cloud sequence-based 3D human action recognition,” IEEE International Conference on Multimedia & Expo (ICME), 2022, doi: 10.1109/ICME52920.2022.9859807.
  10. W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2010.
  11. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2016.
  12. T. Le and Y. Duan, “PointGrid: A deep network for 3D shape understanding,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2018.
  13. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2017.
  14. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems, 2017.
  15. A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A. N.Gomez, and I. Polosukhin. “Attention is all you need,” In Advances in neural information processing systems, 2017.
  16. J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Yuduan, and A. C. Kot, “NTU RGB+D 120: A largescale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  17. P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona, “Depth pooling based large-scale 3-d action recognition with convolutional neural networks,” IEEE Transactions on Multimedia, vol. 20, no. 5, 2018.
  18. Y. Xiao, J. Chen, Y. Wang, Z. Cao, J. T. Zhou, and X. Bai, “Action recognition for depth video using multi-view dynamic images,” Information Sciences, vol. 480, 2019.
  19. A. Sanchez-Caballero, S. de López-Diz, D. Fuentes-Jimenez, C. Losada-Gutiérrez, M. Marrón-Romera, D. Casillas-Perez, and M. I. Sarker, “3dfcnn: Real-time action recognition using 3d deep neural networks with raw depth information,” arXiv preprint arXiv:2006.07743, 2020.
  20. A. Sanchez-Caballero, D. Fuentes-Jimenez, and C. Losada-Gutierrez, “Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks,” arXiv preprint arXiv:2006.07744, 2020.
  21. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” Thirty-second AAAI conference on artificial intelligence, 2018.
  22. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019.
  23. P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  24. L. Li, M. Wang, B. Ni, H. Wang, J. Yang, and W. Zhang, “3d human action representation learning via cross-view consistency pursuit,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  25. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  26. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  27. K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  28. A. Kläser, M. Marszałek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” BMVC 2008-19th British Machine Vision Conference British Machine Vision Association, 2008.
  29. W. A. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu, and M. F. Campos, “Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences,” Iberoamerican congress on pattern recognition, Springer, Berlin, Heidelberg, 2012.
  30. A. Dosovitskiy, L. Beyer, and A. Kolesnikov, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
Citations (8)

Summary

We haven't generated a summary for this paper yet.