Real-time 3D human action recognition based on Hyperpoint sequence (2111.08492v3)
Abstract: Real-time 3D human action recognition has broad industrial applications, such as surveillance, human-computer interaction, and healthcare monitoring. By relying on complex spatio-temporal local encoding, most existing point cloud sequence networks capture spatio-temporal local structures to recognize 3D human actions. To simplify the point cloud sequence modeling task, we propose a lightweight and effective point cloud sequence network referred to as SequentialPointNet for real-time 3D action recognition. Instead of capturing spatio-temporal local structures, SequentialPointNet encodes the temporal evolution of static appearances to recognize human actions. Firstly, we define a novel type of point data, Hyperpoint, to better describe the temporally changing human appearances. A theoretical foundation is provided to clarify the information equivalence property for converting point cloud sequences into Hyperpoint sequences. Secondly, the point cloud sequence modeling task is decomposed into a Hyperpoint embedding task and a Hyperpoint sequence modeling task. Specifically, for Hyperpoint embedding, the static point cloud technology is employed to convert point cloud sequences into Hyperpoint sequences, which introduces inherent frame-level parallelism; for Hyperpoint sequence modeling, a Hyperpoint-Mixer module is designed as the basic building block to learning the spatio-temporal features of human actions. Extensive experiments on three widely-used 3D action recognition datasets demonstrate that the proposed SequentialPointNet achieves competitive classification performance with up to 10X faster than existing approaches.
- Q. Zhu, Z. Chen, and C. S. Yeng, “A novel semi-supervised deep learning method for human activity recognition,” Transactions on Industrial Informatics, vol. 17, no. 5, pp. 3821–3830, Jul. 2019.
- T. Huynh-The, C. H. Hua, and D. S. Kim, “Encoding pose features to images with data augmentation for 3-D action recognition” IEEE Transactions on Industrial Informatics. 2019.
- Z. Zuo, L. Yang, Y. Liu, F. Chao, R. Song, and Y. Qu, “Histogram of fuzzy local spatio-temporal descriptors for video action recognition,” IEEE Transactions on Industrial Informatics. 2019.
- Y. Min, Y. Zhang, X. Chai, and X. Chen, “An efficient pointlstm for point clouds based gesture recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
- Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- X. Liu, M. Yan, and J. Bohg, “MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences,” Proceedings of the IEEE International Conference on Computer Vision, 2019.
- H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” International Conference on Learning Representations, 2021.
- H. Fan, Y. Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- X. Li, Q. Huang, T. Yang, and Q, Wu, “HyperpointNet for point cloud sequence-based 3D human action recognition,” IEEE International Conference on Multimedia & Expo (ICME), 2022, doi: 10.1109/ICME52920.2022.9859807.
- W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2010.
- A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2016.
- T. Le and Y. Duan, “PointGrid: A deep network for 3D shape understanding,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2018.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2017.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems, 2017.
- A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A. N.Gomez, and I. Polosukhin. “Attention is all you need,” In Advances in neural information processing systems, 2017.
- J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Yuduan, and A. C. Kot, “NTU RGB+D 120: A largescale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, 2020.
- P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona, “Depth pooling based large-scale 3-d action recognition with convolutional neural networks,” IEEE Transactions on Multimedia, vol. 20, no. 5, 2018.
- Y. Xiao, J. Chen, Y. Wang, Z. Cao, J. T. Zhou, and X. Bai, “Action recognition for depth video using multi-view dynamic images,” Information Sciences, vol. 480, 2019.
- A. Sanchez-Caballero, S. de López-Diz, D. Fuentes-Jimenez, C. Losada-Gutiérrez, M. Marrón-Romera, D. Casillas-Perez, and M. I. Sarker, “3dfcnn: Real-time action recognition using 3d deep neural networks with raw depth information,” arXiv preprint arXiv:2006.07743, 2020.
- A. Sanchez-Caballero, D. Fuentes-Jimenez, and C. Losada-Gutierrez, “Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks,” arXiv preprint arXiv:2006.07744, 2020.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” Thirty-second AAAI conference on artificial intelligence, 2018.
- M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019.
- P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- L. Li, M. Wang, B. Ni, H. Wang, J. Yang, and W. Zhang, “3d human action representation learning via cross-view consistency pursuit,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- A. Kläser, M. Marszałek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” BMVC 2008-19th British Machine Vision Conference British Machine Vision Association, 2008.
- W. A. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu, and M. F. Campos, “Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences,” Iberoamerican congress on pattern recognition, Springer, Berlin, Heidelberg, 2012.
- A. Dosovitskiy, L. Beyer, and A. Kolesnikov, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.