Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition (1904.12659v1)

Published 26 Apr 2019 in cs.CV and cs.AI

Abstract: Action recognition with skeleton data has recently attracted much attention in computer vision. Previous studies are mostly based on fixed skeleton graphs, only capturing local physical dependencies among joints, which may miss implicit joint correlations. To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions. We also extend the existing skeleton graphs to represent higher-order dependencies, i.e. structural links. Combing the two types of links into a generalized skeleton graph, we further propose the actional-structural graph convolution network (AS-GCN), which stacks actional-structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. A future pose prediction head is added in parallel to the recognition head to help capture more detailed action patterns through self-supervision. We validate AS-GCN in action recognition using two skeleton data sets, NTU-RGB+D and Kinetics. The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods. As a side product, AS-GCN also shows promising results for future pose prediction.

Citations (843)

View on Semantic Scholar

Summary

The paper introduces AS-GCN that infers action-specific dependencies via an encoder-decoder A-link module to capture latent joint correlations.
It leverages generalized skeleton graphs with higher-order structural links to effectively model long-range dependencies in 3D joint movements.
Extensive experiments on NTU-RGB+D and Kinetics demonstrate significant accuracy improvements over prior skeleton-based action recognition methods.

Overview of Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition

Skeleton-based action recognition presents a significant challenge in computer vision, requiring the model to interpret dynamic 3D joint positions effectively. Previous studies often employed fixed skeleton graphs that only capture local physical dependencies among joints, potentially neglecting important latent correlations. The paper introduces Actional-Structural Graph Convolutional Networks (AS-GCN), which innovatively combines an actional links (A-links) inference module with generalized skeleton graphs to better capture both local and global dependencies for action recognition.

Methodology

The methodology presented in this paper hinges on two core components:

A-link Inference Module (AIM): This module employs an encoder-decoder architecture to infer action-specific dependencies directly from skeleton data, thus capturing implicit joint correlations that are action-specific. The encoder iteratively propagates information between joints and links, facilitating the discovery of these latent dependencies. Notably, the probabilities of these inferred links are trained to promote sparsity, ensuring that only significant dependencies are captured.
Generalized Skeleton Graphs: The paper extends traditional skeleton graphs to represent higher-order dependencies using structural links (S-links). By considering polynomial orders of the adjacency matrix, the S-links enable the model to capture long-range dependencies that single-hop neighbors might miss.

The AS-GCN architecture integrates these two components through the Actional-Structural Graph Convolution (ASGC), which combines the A-links and S-links to learn more comprehensive spatial features. Temporal features are captured using temporal convolutions added in parallel, thereby ensuring robust spatio-temporal feature learning.

Experimental Results

The effectiveness of AS-GCN is validated on two large-scale datasets: NTU-RGB+D and Kinetics.

NTU-RGB+D: On the Cross-Subject benchmark, AS-GCN achieves an accuracy of 86.8%, demonstrating a notable improvement over previous methods such as ST-GCN and SR-TSL. On the Cross-View benchmark, AS-GCN attains 94.2%, again outperforming other state-of-the-art approaches.
Kinetics: For the Kinetics dataset, the model achieves a top-1 accuracy of 34.8% and a top-5 accuracy of 56.5%, indicating substantial gains compared to existing models like ST-GCN.

Ablation Studies

Several ablation studies are conducted to evaluate the impact of various components. The results indicate that both A-links and higher-order S-links significantly contribute to the performance improvement. The number and prior probabilities of A-links are optimized to ensure the model captures the most relevant action-specific dependencies. Additionally, the inclusion of a future pose prediction head enhances performance by approximately 1%, showcasing the benefits of self-supervision in capturing detailed action patterns.

Implications and Future Work

The proposed AS-GCN presents several important implications:

Improved Action Recognition: By capturing richer dependencies through A-links and S-links, AS-GCN provides a more sophisticated understanding of action-specific patterns, thereby improving recognition accuracy.
Generalizability to Various Actions: The ability to infer and adapt to various action-specific dependencies makes AS-GCN highly adaptable for different types of actions and datasets.
Future Pose Prediction: The inclusion of a prediction head not only enhances recognition performance but also paves the way for real-time action anticipation applications.

Future developments could focus on extending the model to more complex scenarios involving multi-human interactions or incorporating additional sensory data to further improve action recognition capabilities. Additionally, integrating AS-GCN with real-time systems could significantly enhance applications in video surveillance, human-machine interaction, and virtual reality.

In conclusion, this paper presents a comprehensive and effective approach to skeleton-based action recognition by combining actional and structural dependencies into a unified model. The AS-GCN framework sets a new benchmark in the field, providing a robust foundation for future advancements in spatio-temporal action recognition.

PDF Markdown