Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence (2401.00921v1)

Published 1 Jan 2024 in cs.CV

Abstract: Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint coordinates or temporal motion, as prediction targets for the masked regions, which is suboptimal. In this paper, we show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework, which utilizes a transformer-based teacher encoder taking unmasked training samples as input to create latent contextualized representations as prediction targets. Benefiting from the self-attention mechanism, the latent representations generated by the teacher encoder can incorporate the global context of the entire training samples, leading to a richer training task. Additionally, considering the high temporal correlations in skeleton sequences, we propose a motion-aware tube masking strategy which divides the skeleton sequence into several tubes and performs persistent masking within each tube based on motion priors, thus forcing the model to build long-range spatio-temporal connections and focus on action-semantic richer regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate that our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
  2. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In International Conference on Machine Learning, pages 1416–1429. PMLR, 2023.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI, 2018.
  5. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV, pages 13359–13368, 2021.
  6. Hierarchically self-supervised transformer for human skeleton representation learning. In ECCV, pages 185–202. Springer, 2022.
  7. Hierarchical contrast for unsupervised skeleton-based action representation learning. In AAAI, 2023a.
  8. Peco: Perceptual codebook for bert pre-training of vision transformers. In AAAI, pages 552–560, 2023b.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  10. Skeleton based action recognition with convolutional neural network. In 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pages 579–583. IEEE, 2015.
  11. Dg-stgcn: dynamic spatial-temporal modeling for skeleton-based action recognition. arXiv preprint arXiv:2210.05895, 2022.
  12. Rmpe: Regional multi-person pose estimation. In ICCV, pages 2334–2343, 2017.
  13. Hyperbolic self-paced learning for self-supervised skeleton-based action representations. In Int. Conf. Learn. Represent., 2023.
  14. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 33:21271–21284, 2020.
  15. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In AAAI, pages 762–770, 2022.
  16. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  17. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  18. Part aware contrastive learning for self-supervised action recognition. In Int. J. Comput. Vis., 2023.
  19. Skeleton-based action recognition with convolutional neural networks. In 2017 IEEE international conference on multimedia & expo workshops (ICMEW), pages 597–600. IEEE, 2017.
  20. 3d human action representation learning via cross-view consistency pursuit. In CVPR, pages 4741–4750, 2021.
  21. Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In ACM MM, pages 2490–2498, 2020.
  22. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In CVPR, pages 2363–2372, 2023.
  23. Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475, 2017.
  24. Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 816–833. Springer, 2016.
  25. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell., 42(10):2684–2701, 2019.
  26. Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. In ECCV, pages 734–752. Springer, 2022.
  27. Masked motion predictors are strong 3d action representation learners. In ICCV, pages 10181–10191, 2023.
  28. Bootstrapped representation learning for skeleton-based action recognition. In CVPR, pages 4154–4164, 2022.
  29. Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In ECCV, pages 102–118. Springer, 2020.
  30. Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Information Sciences, 569:90–109, 2021.
  31. Halp: Hallucinating latent positives for skeleton-based self-supervised learning of actions. In CVPR, pages 18846–18856, 2023.
  32. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, pages 1010–1019, 2016.
  33. Predict & cluster: Unsupervised skeleton based action recognition. In CVPR, pages 9631–9640, 2020.
  34. Skeleton-contrastive 3d action representation learning. In ACM MM, pages 1655–1663, 2021.
  35. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 35:10078–10093, 2022.
  36. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  37. Attention is all you need. NeurIPS, 30, 2017.
  38. Masked feature prediction for self-supervised visual pre-training. In CVPR, pages 14668–14678, 2022.
  39. Skeletonmae: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition. In 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pages 224–229. IEEE, 2023.
  40. Simmim: A simple framework for masked image modeling. In CVPR, pages 9653–9663, 2022.
  41. Deep kinematics analysis for monocular 3d human pose estimation. In CVPR, pages 899–908, 2020.
  42. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In ICCV, pages 5606–5618, 2023.
  43. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
  44. Skeleton cloud colorization for unsupervised 3d action representation learning. In ICCV, pages 13423–13433, 2021.
  45. Contrastive positive mining for unsupervised 3d action representation learning. In ECCV, pages 36–51. Springer, 2022.
  46. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE international conference on computer vision, pages 2117–2126, 2017.
  47. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In AAAI, 2018.
  48. Self-supervised action representation learning from partial spatio-temporal skeleton sequences. In AAAI, 2023.
  49. Motionbert: A unified perspective on learning human motion representations. In ICCV, pages 15085–15099, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ruizhuo Xu (3 papers)
  2. Linzhi Huang (7 papers)
  3. Mei Wang (41 papers)
  4. Jiani Hu (13 papers)
  5. Weihong Deng (71 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.