SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training (2307.08476v1)

Published 17 Jul 2023 in cs.CV

Abstract: Skeleton sequence representation learning has shown great advantages for action recognition due to its promising ability to model human joints and topology. However, the current methods usually require sufficient labeled data for training computationally expensive models, which is labor-intensive and time-consuming. Moreover, these methods ignore how to utilize the fine-grained dependencies among different skeleton joints to pre-train an efficient skeleton sequence learning model that can generalize well across different datasets. In this paper, we propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL). To comprehensively capture the human pose and obtain discriminative skeleton sequence representation, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE, which embeds skeleton joint sequence into Graph Convolutional Network (GCN) and reconstructs the masked skeleton joints and edges based on the prior human topology knowledge. Then, the pre-trained SkeletonMAE encoder is integrated with the Spatial-Temporal Representation Learning (STRL) module to build the SSL framework. Extensive experimental results show that our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based action recognition methods on FineGym, Diving48, NTU 60 and NTU 120 datasets. Additionally, we obtain comparable performance to some fully supervised methods. The code is avaliable at https://github.com/HongYan1123/SkeletonMAE.

Authors (6)

Hong Yan (70 papers)
Yang Liu (2253 papers)
Yushen Wei (3 papers)
Zhen Li (334 papers)
Guanbin Li (177 papers)
Liang Lin (318 papers)

Citations (29)

View on Semantic Scholar

Summary

Overview of "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training"

The paper presents SkeletonMAE, a novel framework for pre-training skeleton sequence learning models, particularly valuable in action recognition tasks involving human skeletons. The authors aim to address the limitations of current skeleton-based action recognition methods, which typically rely on large amounts of labeled data and are computationally intensive. Moreover, these existing methods often overlook the fine-grained dependencies between different skeleton joints, which are crucial for learning transferability across different datasets.

Core Contributions

The proposed framework, Skeleton Sequence Learning (SSL), incorporates an asymmetric graph-based encoder-decoder architecture named SkeletonMAE. This approach effectively leverages the structure of skeleton joints by embedding sequences into Graph Convolutional Networks (GCN) and reconstructing masked joints and edges using prior knowledge of human topology. The use of graph-based embeddings allows for a more nuanced capture of the spatial and temporal dynamics essential for accurately modeling human actions.

SkeletonMAE serves as a pre-training mechanism where its encoder, once trained, is integrated with a Spatial-Temporal Representation Learning (STRL) module, forming the foundation of the SSL framework. Experimental evaluations show that SSL not only generalizes well across different datasets but also surpasses state-of-the-art self-supervised skeleton-based action recognition methods on benchmarks such as FineGym, Diving48, NTU 60, and NTU 120.

Technical Approach

The authors present several technical innovations:

Asymmetric Graph-based Encoder-Decoder: The asymmetric design in SkeletonMAE enables efficient pre-training by using a deeper encoder to learn a rich representation while maintaining a lightweight decoder for reconstruction. This design is tailored to reconstruct masked portions of skeleton data informed by human anatomical knowledge.
Action-sensitive Masking Strategy: Unlike traditional methods that use random masking, SkeletonMAE employs a strategy sensitive to actions, reconstructing limbs or body parts essential for differentiating given action classes. This nuanced masking engenders models capable of understanding fine-grained dependencies between joint movements.
Integration with STRL: By merging the pre-trained SkeletonMAE encoder with STRL, the framework can exploit spatial-temporal dependencies more effectively, yielding a compact and discriminative feature representation for downstream tasks.

Experimental Results

The experimental evaluation across multiple benchmark datasets demonstrates the efficacy of SkeletonMAE. The proposed framework significantly outperforms existing methods in several scenarios:

FineGym and Diving48 task evaluations show improved accuracy over state-of-the-art methods, indicating the framework's robustness in understanding fine-grained actions and high variability in motion dynamics.
On NTU 60 and NTU 120 datasets, the SSL framework surpasses other self-supervised approaches, highlighting its superior generalization capacity across both cross-subject and cross-view settings.

Implications and Future Directions

The introduction of SkeletonMAE marks a substantive advance in self-supervised learning for skeleton sequences. By focusing on graph-based representations and action-specific masking, this work paves the way for more generalized and transferable models in action recognition. Future research could explore extending this approach to accommodate other modalities or integrating SkeletonMAE into multi-view learning frameworks to further enhance robustness and accuracy in various application domains. Additionally, exploring different architectural designs for the encoder-decoder and improving the computational efficiency of such models could broaden their applicability in real-world scenarios where computational resources may be limited.

PDF Markdown