Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Scene Flow With Skeleton Guidance For 3D Action Recognition

Published 23 Jun 2023 in cs.CV and eess.IV | (2306.13285v1)

Abstract: Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Human activity analysis: A review. ACM Computing Surveys (CSUR) 43(3) (2011)  16
  2. Human activity recognition from 3d data: A review. Pattern Recognition Letters 48 (2014) 70–80
  3. Hierarchical recurrent neural network for skeleton based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
  4. Human action recognition by representing 3d skeletons as points in a lie group. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2014)
  5. Spatio-temporal lstm with trust gates for 3d human action recognition. In Leibe, B., Matas, J., Sebe, N., Welling, M., eds.: Computer Vision – ECCV 2016, Cham, Springer International Publishing (2016) 816–833
  6. On geometric features for skeleton-based action recognition using multilayer lstm networks. In: Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, IEEE (2017) 148–157
  7. Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE computer Society (2017) 6099–6108
  8. Mining actionlet ensemble for action recognition with depth cameras. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1290–1297
  9. Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2017) 20–28
  10. A new representation of skeleton sequences for 3d action recognition. arXiv preprint arXiv:1703.03492 (2017)
  11. Global context-aware attention lstm networks for 3d action recognition. In: CVPR. (2017)
  12. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 1012–1020
  13. Adaptive rnn tree for large-scale human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 1444–1452
  14. Histogram of oriented principal components for cross-view action recognition. IEEE transactions on pattern analysis and machine intelligence 38(12) (2016) 2430–2443
  15. 3d action recognition from novel viewpoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1506–1515
  16. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 2834–2841
  17. Action recognition based on a bag of 3d points. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE (2010) 9–14
  18. Human daily action analysis with multi-view and color-depth data. In: Computer Vision–ECCV 2012. Workshops and Demonstrations, Springer (2012) 52–61
  19. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
  20. Multimodal multipart learning for action recognition in depth videos. IEEE transactions on pattern analysis and machine intelligence 38(10) (2016) 2123–2129
  21. Jointly learning heterogeneous features for rgb-d activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 5344–5352
  22. Structure-preserving binary representations for rgb-d action recognition. IEEE transactions on pattern analysis and machine intelligence 38(8) (2016) 1651–1664
  23. Two-stream rnn/cnn for action recognition in 3d videos. Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on (2017)
  24. Learning action recognition model from depth and skeleton videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 5832–5841
  25. Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems. (2014) 568–576
  26. Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2718–2726
  27. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1933–1941
  28. Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
  29. A primal-dual framework for real-time dense rgb-d scene flow. In: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE (2015) 98–104
  30. Deep affordance-grounded sensorimotor object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017)
  31. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32(8) (2013) 951–970
  32. Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017)
  33. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. (2015) 4489–4497
  34. Multimodal deep learning for robust rgb-d object recognition. In: Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, IEEE (2015) 681–687
  35. On the integration of optical flow and action recognition. arXiv preprint arXiv:1712.08416 (2017)
  36. Accurate optical flow in noisy image sequences. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Volume 1., IEEE (2001) 587–592
  37. Sequence of the most informative joints (smij): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation 25(1) (2014) 24–38
  38. Discovering discriminative action parts from mid-level video representations. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1242–1249
  39. Action-attending graphic neural network. arXiv preprint arXiv:1711.06427 (2017)
  40. Rensink, R.A.: The dynamic representation of scenes. Visual cognition 7(1-3) (2000) 17–42
  41. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 2956–2964
  42. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018) 41–50
  43. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI. (2017) 4263–4270
  44. Spatial transformer networks. In: Advances in Neural Information Processing Systems. (2015) 2017–2025
  45. Residual attention network for image classification. (2017)
  46. Learning where to attend with deep architectures for image tracking. Neural computation 24(8) (2012) 2151–2184
  47. Attention is all you need. In Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., eds.: Advances in Neural Information Processing Systems 30. Curran Associates, Inc. (2017) 6000–6010
  48. Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE transactions on cybernetics 48(3) (2018) 1095–1108
  49. Temporal convolutional networks for action segmentation and detection. arXiv preprint arXiv:1611.05267 (2016)
  50. Graph distillation for action detection with privileged information. arXiv preprint arXiv:1712.00108 (2017)
  51. Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016)
  52. Train, diagnose and fix: Interpretable approach for fine-grained action recognition. arXiv preprint arXiv:1711.08502 (2017)
  53. Cooperative training of deep aggregation networks for rgb-d action recognition. arXiv preprint arXiv:1801.01080 (2017)
  54. Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821 (2017)
  55. Gated siamese convolutional neural network architecture for human re-identification. In Leibe, B., Matas, J., Sebe, N., Welling, M., eds.: Computer Vision – ECCV 2016, Cham, Springer International Publishing (2016) 791–808
  56. Action recognition using visual attention. (2016)
  57. Human action recognition: Pose-based attention draws focus to hands. In: ICCV Workshop on Hands in Action. (2017)
  58. Identity mappings in deep residual networks. In: European Conference on Computer Vision, Springer (2016) 630–645
  59. Keras (2015)
  60. Tensorflow: A system for large-scale machine learning. In: OSDI. Volume 16. (2016) 265–283
  61. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. (2010) 249–256
  62. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2014) 1725–1732

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.