Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence (2401.00921v1)
Abstract: Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint coordinates or temporal motion, as prediction targets for the masked regions, which is suboptimal. In this paper, we show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework, which utilizes a transformer-based teacher encoder taking unmasked training samples as input to create latent contextualized representations as prediction targets. Benefiting from the self-attention mechanism, the latent representations generated by the teacher encoder can incorporate the global context of the entire training samples, leading to a richer training task. Additionally, considering the high temporal correlations in skeleton sequences, we propose a motion-aware tube masking strategy which divides the skeleton sequence into several tubes and performs persistent masking within each tube based on motion priors, thus forcing the model to build long-range spatio-temporal connections and focus on action-semantic richer regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate that our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
- Efficient self-supervised learning with contextualized target representations for vision, speech and language. In International Conference on Machine Learning, pages 1416–1429. PMLR, 2023.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI, 2018.
- Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV, pages 13359–13368, 2021.
- Hierarchically self-supervised transformer for human skeleton representation learning. In ECCV, pages 185–202. Springer, 2022.
- Hierarchical contrast for unsupervised skeleton-based action representation learning. In AAAI, 2023a.
- Peco: Perceptual codebook for bert pre-training of vision transformers. In AAAI, pages 552–560, 2023b.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Skeleton based action recognition with convolutional neural network. In 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pages 579–583. IEEE, 2015.
- Dg-stgcn: dynamic spatial-temporal modeling for skeleton-based action recognition. arXiv preprint arXiv:2210.05895, 2022.
- Rmpe: Regional multi-person pose estimation. In ICCV, pages 2334–2343, 2017.
- Hyperbolic self-paced learning for self-supervised skeleton-based action representations. In Int. Conf. Learn. Represent., 2023.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 33:21271–21284, 2020.
- Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In AAAI, pages 762–770, 2022.
- Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- Part aware contrastive learning for self-supervised action recognition. In Int. J. Comput. Vis., 2023.
- Skeleton-based action recognition with convolutional neural networks. In 2017 IEEE international conference on multimedia & expo workshops (ICMEW), pages 597–600. IEEE, 2017.
- 3d human action representation learning via cross-view consistency pursuit. In CVPR, pages 4741–4750, 2021.
- Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In ACM MM, pages 2490–2498, 2020.
- Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In CVPR, pages 2363–2372, 2023.
- Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475, 2017.
- Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 816–833. Springer, 2016.
- Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell., 42(10):2684–2701, 2019.
- Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. In ECCV, pages 734–752. Springer, 2022.
- Masked motion predictors are strong 3d action representation learners. In ICCV, pages 10181–10191, 2023.
- Bootstrapped representation learning for skeleton-based action recognition. In CVPR, pages 4154–4164, 2022.
- Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In ECCV, pages 102–118. Springer, 2020.
- Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Information Sciences, 569:90–109, 2021.
- Halp: Hallucinating latent positives for skeleton-based self-supervised learning of actions. In CVPR, pages 18846–18856, 2023.
- Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, pages 1010–1019, 2016.
- Predict & cluster: Unsupervised skeleton based action recognition. In CVPR, pages 9631–9640, 2020.
- Skeleton-contrastive 3d action representation learning. In ACM MM, pages 1655–1663, 2021.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 35:10078–10093, 2022.
- Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Attention is all you need. NeurIPS, 30, 2017.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, pages 14668–14678, 2022.
- Skeletonmae: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition. In 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pages 224–229. IEEE, 2023.
- Simmim: A simple framework for masked image modeling. In CVPR, pages 9653–9663, 2022.
- Deep kinematics analysis for monocular 3d human pose estimation. In CVPR, pages 899–908, 2020.
- Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In ICCV, pages 5606–5618, 2023.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
- Skeleton cloud colorization for unsupervised 3d action representation learning. In ICCV, pages 13423–13433, 2021.
- Contrastive positive mining for unsupervised 3d action representation learning. In ECCV, pages 36–51. Springer, 2022.
- View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE international conference on computer vision, pages 2117–2126, 2017.
- Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In AAAI, 2018.
- Self-supervised action representation learning from partial spatio-temporal skeleton sequences. In AAAI, 2023.
- Motionbert: A unified perspective on learning human motion representations. In ICCV, pages 15085–15099, 2023.
- Ruizhuo Xu (3 papers)
- Linzhi Huang (7 papers)
- Mei Wang (41 papers)
- Jiani Hu (13 papers)
- Weihong Deng (71 papers)