Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Decoupled Spatio-Temporal Framework for Skeleton-based Action Segmentation (2312.05830v1)

Published 10 Dec 2023 in cs.CV

Abstract: Effectively modeling discriminative spatio-temporal information is essential for segmenting activities in long action sequences. However, we observe that existing methods are limited in weak spatio-temporal modeling capability due to two forms of decoupled modeling: (i) cascaded interaction couples spatial and temporal modeling, which over-smooths motion modeling over the long sequence, and (ii) joint-shared temporal modeling adopts shared weights to model each joint, ignoring the distinct motion patterns of different joints. We propose a Decoupled Spatio-Temporal Framework (DeST) to address the above issues. Firstly, we decouple the cascaded spatio-temporal interaction to avoid stacking multiple spatio-temporal blocks, while achieving sufficient spatio-temporal interaction. Specifically, DeST performs once unified spatial modeling and divides the spatial features into different groups of subfeatures, which then adaptively interact with temporal features from different layers. Since the different sub-features contain distinct spatial semantics, the model could learn the optimal interaction pattern at each layer. Meanwhile, inspired by the fact that different joints move at different speeds, we propose joint-decoupled temporal modeling, which employs independent trainable weights to capture distinctive temporal features of each joint. On four large-scale benchmarks of different scenes, DeST significantly outperforms current state-of-the-art methods with less computational complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. X. Luo, H. Li, X. Yang, Y. Yu, and D. Cao, “Capturing and understanding workers’ activities in far-field surveillance videos with deep action recognition and bayesian nonparametric learning,” Comput-aided Civ. Inf., vol. 34, no. 4, pp. 333–351, 2019.
  2. H. Son, H. Choi, H. Seong, and C. Kim, “Detection of construction workers under varying poses and changing background in image sequences via very deep residual networks,” Automat. Constr., vol. 99, pp. 27–38, 2019.
  3. B. Filtjens, A. Nieuwboer, N. D’cruz, J. Spildooren, P. Slaets, and B. Vanrumste, “A data-driven approach for detecting gait events during turning in people with parkinson’s disease and freezing of gait,” Gait & Posture, vol. 80, pp. 130–136, 2020.
  4. Ł. Kidziński, S. Delp, and M. Schwartz, “Automatic real-time gait event detection in children using deep neural networks,” PLoS ONE, vol. 14, no. 1, p. e0211466, 2019.
  5. J. Kenney, T. Buckley, and O. Brock, “Interactive segmentation for manipulation in unstructured environments,” in ICRA.   IEEE, 2009, pp. 1377–1382.
  6. M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny, and M. Jagersand, “Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting,” in ICRA.   IEEE, 2019, pp. 50–56.
  7. M. Sudha, K. Sriraghav, S. G. Jacob, S. Manisha et al., “Approaches and applications of virtual reality and gesture recognition: A review,” IJCAI, vol. 8, no. 4, pp. 1–18, 2017.
  8. A. Biswas, S. Dutta, N. Dey, and A. T. Azar, “A kinect-less augmented reality approach to real-time tag-less virtual trial room simulation,” IJSSMET, vol. 5, no. 4, pp. 13–28, 2014.
  9. M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognit., vol. 68, pp. 346–362, 2017.
  10. Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Constructing stronger and faster baselines for skeleton-based action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1474–1488, 2022.
  11. K. Cheng, Y. Zhang, X. He, J. Cheng, and H. Lu, “Extremely lightweight skeleton-based action recognition with shiftgcn++,” IEEE Trans. Image Process., vol. 30, pp. 7333–7348, 2021.
  12. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 6299–6308.
  13. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Adv. Neural Inform. Process. Syst., vol. 27, 2014.
  14. L. Wang, W. Li, W. Li, and L. Van Gool, “Appearance-and-relation networks for video classification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 1430–1439.
  15. L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 4305–4314.
  16. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Eur. Conf. Comput. Vis.   Springer, 2016, pp. 20–36.
  17. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 3202–3211.
  18. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Int. Conf. Comput. Vis., 2021, pp. 6836–6846.
  19. L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2023, pp. 14 549–14 560.
  20. B. Filtjens, B. Vanrumste, and P. Slaets, “Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks,” IEEE Trans. Emerg. Top. Com., 2022.
  21. L. Xu, Q. Wang, X. Lin, and L. Yuan, “An efficient framework for few-shot skeleton-based temporal action segmentation,” Comput. Vis. Image Und., vol. 232, p. 103707, 2023.
  22. K. Liu, Y. Li, Y. Xu, S. Liu, and S. Liu, “Spatial focus attention for fine-grained skeleton-based action tasks,” IEEE Sign. Process. Letters, vol. 29, pp. 1883–1887, 2022.
  23. Y.-H. Li, K.-Y. Liu, S.-L. Liu, L. Feng, and H. Qiao, “Involving distinguished temporal graph convolutional networks for skeleton-based temporal action segmentation,” IEEE Trans. Circuit Syst. Video Technol., 2023.
  24. D. Bo, X. Wang, C. Shi, and H. Shen, “Beyond low-frequency information in graph convolutional networks,” in AAAI.   AAAI Press, 2021.
  25. C. Pang, X. Lu, and L. Lyu, “Skeleton-based action recognition through contrasting two-stream spatial-temporal networks,” IEEE Trans. Multimedia, 2023.
  26. S. Liu, A. Zhang, Y. Li, J. Zhou, L. Xu, Z. Dong, and R. Zhang, “Temporal segmentation of fine-grained semantic action: A motion-centered figure skating dataset,” in AAAI, vol. 35, no. 3, 2021, pp. 2163–2171.
  27. C. Liu, Y. Hu, Y. Li, S. Song, and J. Liu, “Pku-mmd: A large scale benchmark for skeleton-based human action understanding,” in ACM VASCCW, 2017, pp. 1–8.
  28. F. Niemann, C. Reining, F. Moya Rueda, N. R. Nair, J. A. Steffens, G. A. Fink, and M. Ten Hompel, “Lara: Creating a dataset for human activity recognition in logistics using semantic attributes,” Sensors, vol. 20, no. 15, p. 4083, 2020.
  29. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in IEEE Conf. Comput. Vis. Pattern Recog., July 2017.
  30. L. Ding and C. Xu, “Tricornet: A hybrid temporal convolutional and recurrent network for video action segmentation,” arXiv preprint arXiv:1705.07818, 2017.
  31. B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream bi-directional recurrent neural network for fine-grained action detection,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2016.
  32. C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 156–165.
  33. Y. Li, Z. Dong, K. Liu, L. Feng, L. Hu, J. Zhu, L. Xu, S. Liu et al., “Efficient two-step networks for temporal action segmentation,” Neurocomputing, vol. 454, pp. 373–381, 2021.
  34. Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 3575–3584.
  35. S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, and J. Gall, “Ms-tcn++: Multi-stage temporal convolutional network for action segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6647–6658, 2023.
  36. S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, and M.-M. Cheng, “Global2local: Efficient structure search for video action segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2021, pp. 16 805–16 814.
  37. S. Gao, Z.-Y. Li, Q. Han, M.-M. Cheng, and L. Wang, “Rf-next: Efficient receptive field search for convolutional neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 2984–3002, 2022.
  38. F. Yi, H. Wen, and T. Jiang, “Asformer: Transformer for action segmentation,” in Brit. Mach. Vis. Conf., 2021.
  39. N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi, “Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation,” in Eur. Conf. Comput. Vis.   Springer, 2022, pp. 52–68.
  40. J. G. Emad Bahrami, Gianpiero Francesca, “How much temporal long-term context is needed for action segmentation?” in Int. Conf. Comput. Vis., 2023.
  41. Z.-Z. Wang, Z.-T. Gao, L.-M. Wang, Z.-F. Li, and G.-S. Wu, “Boundary-aware cascade networks for temporal action segmentation,” in Eur. Conf. Comput. Vis.   Springer, 2020.
  42. Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka, “Alleviating over-segmentation errors by detecting action boundaries,” in WACV, 2021, pp. 2322–2331.
  43. H. Ahn and D. Lee, “Refining action segmentation with hierarchical video representations,” in Int. Conf. Comput. Vis., 2021, pp. 16 302–16 310.
  44. M. Li, L. Chen, Y. Duan, Z. Hu, J. Feng, J. Zhou, and J. Lu, “Bridge-prompt: Towards ordinal action understanding in instructional videos,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2022.
  45. D. Liu, Q. Li, A.-D. Dinh, T. Jiang, M. Shah, and C. Xu, “Diffusion action segmentation,” in Int. Conf. Comput. Vis., 2023.
  46. Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition,” in AAAI, vol. 35, no. 2, 2021, pp. 1113–1122.
  47. M. Wang, B. Ni, and X. Yang, “Learning multi-view interactional skeleton graph for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2020.
  48. X. Tian, Y. Jin, Z. Zhang, P. Liu, and X. Tang, “Stga-net: Spatial-temporal graph attention network for skeleton-based temporal action segmentation,” in ICMEW.   IEEE, 2023, pp. 218–223.
  49. X. Hao, J. Li, Y. Guo, T. Jiang, and M. Yu, “Hypergraph neural network for skeleton-based action recognition,” IEEE Trans. Image Process., vol. 30, pp. 2263–2275, 2021.
  50. S. Li, X. He, W. Song, A. Hao, and H. Qin, “Graph diffusion convolutional network for skeleton based semantic recognition of two-person actions,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
  51. G. Hu, B. Cui, and S. Yu, “Joint learning in the spatio-temporal and frequency domains for skeleton-based action recognition,” IEEE Trans. Multimedia, vol. 22, no. 9, pp. 2207–2220, 2019.
  52. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive neural networks for high performance skeleton-based human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1963–1978, 2019.
  53. S. Nowozin and J. Shotton, “Action points: A representation for low-latency online human action recognition,” Microsoft Research Cambridge, Tech. Rep., 2012.
  54. A. Sharaf, M. Torki, M. E. Hussein, and M. El-Saban, “Real-time multi-scale action detection from 3d skeleton data,” in WACV.   IEEE, 2015, pp. 998–1005.
  55. H. Wang and L. Wang, “Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4382–4394, 2018.
  56. M. Korban and X. Li, “Semantics-enhanced early action detection using dynamic dilated convolution,” Pattern Recognit., vol. 140, p. 109595, 2023.
  57. Q. Cui, H. Sun, and F. Yang, “Learning dynamic relationships for 3d human motion prediction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6519–6527.
  58. M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, and Q. Tian, “Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2020.
  59. L. Dang, Y. Nie, C. Long, Q. Zhang, and G. Li, “Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction,” in Int. Conf. Comput. Vis., October 2021, pp. 11 467–11 476.
  60. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3316–3333, 2021.
  61. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI, 2018.
  62. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2019.
  63. K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu, “Decoupling gcn with dropgraph module for skeleton-based action recognition,” in Eur. Conf. Comput. Vis.   Springer, 2020, pp. 536–553.
  64. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in Int. Conf. Comput. Vis., October 2021, pp. 13 359–13 368.
  65. Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Richly activated graph convolutional network for robust skeleton-based action recognition,” IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 5, pp. 1915–1925, 2020.
  66. J. Kong, H. Deng, and M. Jiang, “Symmetrical enhanced fusion network for skeleton-based action recognition,” IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 11, pp. 4394–4408, 2021.
  67. Y. Zhu, H. Shuai, G. Liu, and Q. Liu, “Multilevel spatial–temporal excited graph network for skeleton-based action recognition,” IEEE Trans. Image Process., vol. 32, pp. 496–508, 2022.
  68. J. Liu, X. Wang, C. Wang, Y. Gao, and M. Liu, “Temporal decoupling graph convolutional network for skeleton-based gesture recognition,” IEEE Trans. Multimedia, 2023.
  69. B. Xu, X. Shu, J. Zhang, G. Dai, and Y. Song, “Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition,” IEEE Trans. Neural Networks Learn. Syst., 2023.
  70. E. Aksan, M. Kaufmann, P. Cao, and O. Hilliges, “A spatio-temporal transformer for 3d human motion prediction,” in 3DV.   IEEE, 2021, pp. 565–574.
  71. H. Yu, X. Fan, Y. Hou, W. Pei, H. Ge, X. Yang, D. Zhou, Q. Zhang, and M. Zhang, “Towards realistic 3d human motion prediction with a spatio-temporal cross-transformer approach,” IEEE Trans. Circuit Syst. Video Technol., 2023.
  72. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with multi-stream adaptive graph convolutional networks,” IEEE Trans. Image Process., vol. 29, pp. 9532–9545, 2020.
  73. X. Zhang, C. Xu, X. Tian, and D. Tao, “Graph edge convolutional neural networks for skeleton-based action recognition,” IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 8, pp. 3047–3060, 2019.
  74. C. Cao, C. Lan, Y. Zhang, W. Zeng, H. Lu, and Y. Zhang, “Skeleton-based action recognition with gated convolutional neural networks,” IEEE Trans. Circuit Syst. Video Technol., vol. 29, no. 11, pp. 3247–3257, 2018.
  75. H. Yang, D. Yan, L. Zhang, Y. Sun, D. Li, and S. J. Maybank, “Feedback graph convolutional network for skeleton-based action recognition,” IEEE Trans. Image Process., vol. 31, pp. 164–175, 2021.
  76. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2020.
  77. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Adv. Neural Inform. Process. Syst., 2017, pp. 5998–6008.
  78. S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, 2019.
  79. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in ICML, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 5156–5165.
  80. M.-H. Chen, B. Li, Y. Bao, G. AlRegib, and Z. Kira, “Action segmentation with joint self-supervised temporal domain adaptation,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2020.
  81. Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in IEEE Conf. Comput. Vis. Pattern Recog., July 2017.
  82. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NeurIPSW, 2017.
  83. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Adv. Neural Inform. Process. Syst., vol. 32, 2019.
  84. F. Carrara, P. Elias, J. Sedmidubsky, and P. Zezula, “Lstm-based real-time action detection and prediction in human motion streams,” Multimed Tools Appl., vol. 78, pp. 27 309–27 331, 2019.
  85. A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  86. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” J. Mach. Learn. Res., vol. 9, no. 11, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yunheng Li (9 papers)
  2. Zhongyu Li (72 papers)
  3. Shanghua Gao (20 papers)
  4. Qilong Wang (34 papers)
  5. Qibin Hou (82 papers)
  6. Ming-Ming Cheng (185 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.