Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network (2306.17574v2)

Published 30 Jun 2023 in cs.CV

Abstract: Recent technological advancements have significantly expanded the potential of human action recognition through harnessing the power of 3D data. This data provides a richer understanding of actions, including depth information that enables more accurate analysis of spatial and temporal characteristics. In this context, We study the challenge of 3D human action recognition.Unlike prior methods, that rely on sampling 2D depth images, skeleton points, or point clouds, often leading to substantial memory requirements and the ability to handle only short sequences, we introduce a novel approach for 3D human action recognition, denoted as SpATr (Spiral Auto-encoder and Transformer Network), specifically designed for fixed-topology mesh sequences. The SpATr model disentangles space and time in the mesh sequences. A lightweight auto-encoder, based on spiral convolutions, is employed to extract spatial geometrical features from each 3D mesh. These convolutions are lightweight and specifically designed for fix-topology mesh data. Subsequently, a temporal transformer, based on self-attention, captures the temporal context within the feature sequence. The self-attention mechanism enables long-range dependencies capturing and parallel processing, ensuring scalability for long sequences. The proposed method is evaluated on three prominent 3D human action datasets: Babel, MoVi, and BMLrub, from the Archive of Motion Capture As Surface Shapes (AMASS). Our results analysis demonstrates the competitive performance of our SpATr model in 3D human action recognition while maintaining efficient memory usage. The code and the training results will soon be made publicly available at https://github.com/h-bouzid/spatr.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846.
  2. Learning shape correspondence with anisotropic convolutional neural networks. Advances in neural information processing systems 29.
  3. Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7213–7222.
  4. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901.
  5. Real-time human action recognition based on depth motion maps. Journal of real-time image processing 12, 155–163.
  6. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 .
  7. 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE transactions on cybernetics 45, 1340–1352.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
  9. Skeleton based action recognition with convolutional neural network, in: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), IEEE. pp. 579–583.
  10. Hierarchical recurrent neural network for skeleton based action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118.
  11. Point 4d transformer networks for spatio-temporal modeling in point cloud videos, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14204–14213.
  12. Pstnet: Point spatio-temporal convolution on point cloud sequences. arXiv preprint arXiv:2205.13713 .
  13. Rank pooling for action recognition. IEEE transactions on pattern analysis and machine intelligence 39, 773–787.
  14. Movi: A large multi-purpose human motion and video dataset. Plos one 16, e0253157.
  15. 3d skeleton-based action recognition with convolutional neural networks, in: 2019 international conference on multimedia analysis and pattern recognition (MAPR), IEEE. pp. 1–6.
  16. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations, in: Twenty-third international joint conference on artificial intelligence.
  17. A novel geometric framework on gram matrix trajectories for human behavior understanding. IEEE transactions on pattern analysis and machine intelligence 42, 1–14.
  18. A new representation of skeleton sequences for 3d action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297.
  19. Adam: A method for stochastic optimization. International Conference on Learning Representations .
  20. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks, in: Proceedings of the IEEE international conference on computer vision, pp. 1012–1020.
  21. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn, in: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), IEEE. pp. 601–604.
  22. Sequentialpointnet: A strong frame-level parallel point cloud sequence network for 3d action recognition. arXiv e-prints , arXiv–2111.
  23. Action recognition from depth sequence using depth motion maps-based local ternary patterns and cnn. Multimedia Tools and Applications 78, 19587–19601.
  24. A simple approach to intrinsic correspondence learning on unstructured 3d meshes, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0.
  25. Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Transactions on Image Processing 27, 1586–1599.
  26. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34, 248:1–248:16.
  27. Amass: Archive of motion capture as surface shapes, in: The IEEE International Conference on Computer Vision (ICCV). URL: https://amass.is.tue.mpg.de.
  28. Geodesic convolutional neural networks on riemannian manifolds, in: Proceedings of the IEEE international conference on computer vision workshops, pp. 37–45.
  29. Babel: bodies, action and behavior with english labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 722–731.
  30. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30.
  31. Two-stream adaptive graph convolutional networks for skeleton-based action recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12026–12035.
  32. Deep progressive reinforcement learning for skeleton-based action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5323–5332.
  33. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of vision 2, 2–2.
  34. Attention is all you need. Advances in neural information processing systems 30.
  35. Human action recognition by representing 3d skeletons as points in a lie group, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595.
  36. Mining actionlet ensemble for action recognition with depth cameras, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE. pp. 1290–1297.
  37. Learning actionlet ensemble for 3d human action recognition. IEEE transactions on pattern analysis and machine intelligence 36, 914–927.
  38. Generative multi-view human action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6212–6221.
  39. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. Advances in Neural Information Processing Systems 34, 11960–11973.
  40. 3dv: 3d dynamic voxel for action recognition in depth video, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 511–520.
  41. Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Thirty-second AAAI conference on artificial intelligence.
  42. Recognizing actions using depth motion maps-based histograms of oriented gradients, in: Proceedings of the 20th ACM international conference on Multimedia, pp. 1057–1060.
  43. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32.
  44. Semantics-guided neural networks for efficient skeleton-based human action recognition, in: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1112–1121.
  45. Point transformer, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 16259–16268.
Citations (2)

Summary

We haven't generated a summary for this paper yet.