Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion (2309.06462v3)

Published 12 Sep 2023 in cs.CV

Abstract: This paper presents a 2D skeleton-based action segmentation method with applications in fine-grained human activity recognition. In contrast with state-of-the-art methods which directly take sequences of 3D skeleton coordinates as inputs and apply Graph Convolutional Networks (GCNs) for spatiotemporal feature learning, our main idea is to use sequences of 2D skeleton heatmaps as inputs and employ Temporal Convolutional Networks (TCNs) to extract spatiotemporal features. Despite lacking 3D information, our approach yields comparable/superior performances and better robustness against missing keypoints than previous methods on action segmentation datasets. Moreover, we improve the performances further by using both 2D skeleton heatmaps and RGB videos as inputs. To our best knowledge, this is the first work to utilize 2D skeleton heatmap inputs and the first work to explore 2D skeleton+RGB fusion for action segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  2. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.
  3. C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.
  4. B. Filtjens, B. Vanrumste, and P. Slaets, “Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks,” IEEE Transactions on Emerging Topics in Computing, 2022.
  5. G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern technique,” arXiv preprint arXiv:2210.10352, 2022.
  6. F. Rea, A. Vignolo, A. Sciutti, and N. Noceti, “Human motion understanding for selecting action timing in collaborative human-robot interaction,” Frontiers in Robotics and AI, vol. 6, p. 58, 2019.
  7. H. Khan, S. Haresh, A. Ahmed, S. Siddiqui, A. Konin, M. Z. Zia, and Q.-H. Tran, “Timestamp-supervised action segmentation with graph convolutional networks,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 10 619–10 626.
  8. F. Yang, S. Odashima, S. Masui, and S. Jiang, “Is weakly-supervised action segmentation ready for human-robot interaction? no, let’s improve it with action-union learning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023.
  9. B. Parsa, B. Dariush, et al., “Spatio-temporal pyramid graph convolutions for human action recognition and postural assessment,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 1080–1090.
  10. B. Parsa and A. G. Banerjee, “A multi-task learning approach for human activity segmentation and ergonomics risk assessment,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2352–2362.
  11. J. Ji, W. Pannakkong, P. D. Tai, C. Jeenanunta, and J. Buddhakulsomsiri, “Motion time study with convolutional neural network,” in Integrated Uncertainty in Knowledge Modelling and Decision Making: 8th International Symposium, IUKM 2020, Phuket, Thailand, November 11–13, 2020, Proceedings 8.   Springer, 2020, pp. 249–258.
  12. J. Ji, W. Pannakkong, and J. Buddhakulsomsiri, “A computer vision-based model for automatic motion time study,” CMC-COMPUTERS MATERIALS & CONTINUA, vol. 73, no. 2, pp. 3557–3574, 2022.
  13. L. Ding and C. Xu, “Tricornet: A hybrid temporal convolutional and recurrent network for video action segmentation,” arXiv preprint, 2017.
  14. P. Lei and S. Todorovic, “Temporal deformable residual networks for action segmentation in videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6742–6751.
  15. Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3575–3584.
  16. S.-J. Li, Y. AbuFarha, Y. Liu, M.-M. Cheng, and J. Gall, “Ms-tcn++: Multi-stage temporal convolutional network for action segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  17. H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978.
  18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  19. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  20. C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial intelligence and statistics.   Pmlr, 2015, pp. 562–570.
  21. C. Li, M. Zeeshan Zia, Q.-H. Tran, X. Yu, G. D. Hager, and M. Chandraker, “Deep supervision with shape concepts for occlusion-aware 3d object parsing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5465–5474.
  22. C. Li, M. Z. Zia, Q.-H. Tran, X. Yu, G. D. Hager, and M. Chandraker, “Deep supervision with intermediate concepts,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1828–1843, 2018.
  23. Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka, “Alleviating over-segmentation errors by detecting action boundaries,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2322–2331.
  24. Y. Huang, Y. Sugano, and Y. Sato, “Improving action segmentation via graph-based temporal reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 14 024–14 034.
  25. T. Kobayashi, Y. Aoki, S. Shimizu, K. Kusano, and S. Okumura, “Fine-grained action recognition in assembly work scenes by drawing attention to the hands,” in 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).   IEEE, 2019, pp. 440–446.
  26. H. Ma, Z. Yang, and H. Liu, “Fine-grained unsupervised temporal action segmentation and distributed representation for skeleton-based human motion analysis,” IEEE Transactions on Cybernetics, vol. 52, no. 12, pp. 13 411–13 424, 2021.
  27. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
  28. Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition,” in proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1625–1633.
  29. J. Cai, N. Jiang, X. Han, K. Jia, and J. Lu, “Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2735–2744.
  30. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 359–13 368.
  31. H. Liu, J. Tu, and M. Liu, “Two-stream 3d convolutional neural network for skeleton-based action recognition,” arXiv preprint, 2017.
  32. C. Caetano, J. Sena, F. Brémond, J. A. Dos Santos, and W. R. Schwartz, “Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition,” in 2019 16th IEEE international conference on advanced video and signal based surveillance (AVSS).   IEEE, 2019, pp. 1–8.
  33. Z. Lin, W. Zhang, X. Deng, C. Ma, and H. Wang, “Image-based pose representation for action recognition and hand gesture recognition,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).   IEEE, 2020, pp. 532–539.
  34. S. Asghari-Esfeden, M. Sznaier, and O. Camps, “Dynamic motion representation for human action recognition,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 557–566.
  35. R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  36. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.
  37. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  38. B. Parsa, E. U. Samani, R. Hendrix, C. Devine, S. M. Singh, S. Devasia, and A. G. Banerjee, “Toward ergonomic risk prediction via segmentation of indoor object manipulation actions using spatiotemporal convolutional networks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3153–3160, 2019.
  39. G. Rogez, P. Weinzaepfel, and C. Schmid, “Lcr-net: Localization-classification-regression for human pose,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3433–3441.
  40. M. Tenorth, J. Bandouch, and M. Beetz, “The tum kitchen data set of everyday manipulation activities for motion tracking and action recognition,” in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.   IEEE, 2009, pp. 1089–1096.
  41. S. Kumar, S. Haresh, A. Ahmed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Unsupervised action segmentation by joint representation learning and online clustering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 174–20 185.
  42. Q.-H. Tran, A. Mehmood, M. Ahmed, M. Naufil, A. Zafar, A. Konin, and Z. Zia, “Permutation-aware activity segmentation via unsupervised frame-to-segment alignment,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6426–6436.
  43. W. Li, H. Liu, H. Tang, P. Wang, and L. Van Gool, “Mhformer: Multi-hypothesis transformer for 3d human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 147–13 156.
  44. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al., “The epic-kitchens dataset: Collection, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4125–4141, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.