Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition (2010.09978v1)

Published 20 Oct 2020 in cs.CV

Abstract: One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.

Authors (4)

Yi-Fan Song (5 papers)
Zhang Zhang (77 papers)
Caifeng Shan (27 papers)
Liang Wang (512 papers)

Citations (267)

View on Semantic Scholar

Summary

The paper introduces an efficient baseline that leverages Multiple Input Branches, a Residual GCN with a bottleneck structure, and a Part-wise Attention block.
It significantly reduces parameters while achieving state-of-the-art accuracy on NTU RGB+D datasets.
The approach improves model explainability by capturing spatial dependencies among skeletal parts, enabling practical applications like surveillance and HCI.

An Analysis of "Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition"

Yi-Fan Song and colleagues present a significant contribution to the domain of skeleton-based action recognition through an innovative utilization of Graph Convolutional Networks (GCNs). The paper articulates the challenges associated with this task, particularly highlighting the prevalent sophistication and over-parameterization of state-of-the-art models, which often suffer from inefficiencies during training and inference, especially when applied to large-scale action datasets.

The authors introduce a potent baseline model characterized by three key improvements: Multiple Input Branches (MIB), Residual GCN (ResGCN) with a bottleneck structure, and a Part-wise Attention (PartAtt) block. These components collectively enhance the model's performance while significantly reducing parameter requirements.

Methodology

Multiple Input Branches (MIB): This architecture performs early fusion of input features from multiple branches. These branches extract features such as joint positions, bone features, and motion velocities from skeleton data, allowing the model to maintain compact yet informative representations. The approach mitigates the complexity inherent to multi-stream GCN models.
Residual GCN with Bottleneck Structure: Drawing inspiration from the ResNet architecture, the ResGCN utilizes residual connections to streamline model training and incorporate a bottleneck design to limit computational overhead. This architecture substantially lowers parameter count and expedites convergence relative to other methodologies.
Part-wise Attention (PartAtt) Block: This novel attention mechanism discerns essential body parts across entire action sequences. Unlike prior methods that focus on joint-wise attention, the PartAtt block capitalizes on the spatial dependencies among skeletal parts, leading to more explainable model outputs.

Empirical Evaluation

The proposed model undergoes extensive testing on NTU RGB+D datasets (60 and 120). The results assert its superiority over existing methods, achieving marginally improved accuracy with notably fewer parameters. The baseline ResGCN demonstrated up to 34 times fewer parameters than the Dense Graph Neural Network (DGNN), indicating significant advancements in model efficiency.

Interestingly, on the NTU RGB+D 120 dataset, the PA-ResGCN baseline achieves state-of-the-art results, surpassing other high-parameter models not only in accuracy but also maintaining competitive inference speed.

Implications and Future Directions

This research substantially contributes to computervision, setting precedent for creating efficient, explainable models applicable to real-world scenarios such as surveillance and human-computer interaction. The innovative PartAtt mechanism offers new avenues for enhancing model interpretability in GCN frameworks. Future work could involve extensions of this model that incorporate object appearance information, potentially augmenting its capacity to differentiate extremely similar actions.

Given the methodologies and results presented by Song et al., this work lays a foundation that future studies could leverage to further refine and augment skeleton-based action recognition systems, especially in scenarios demanding rapid and resource-efficient processing.

PDF Markdown