Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition (2106.15125v2)

Published 29 Jun 2021 in cs.CV

Abstract: One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where "x" denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.

Citations (236)

View on Semantic Scholar

Summary

The paper introduces EfficientGCN, a streamlined architecture that significantly reduces computational complexity for skeleton-based action recognition.
It integrates separable convolutions within a multiple input branch design along with compound scaling to optimize performance.
Empirical results show EfficientGCN-B4 achieves 92.1% accuracy on NTU RGB+D 60 while being over 5 times faster and smaller than MS-G3D.

Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition

The paper "Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition" introduces an efficient approach to improve the performance of Graph Convolutional Networks (GCNs) in the context of skeleton-based human action recognition. This paper addresses the substantial complexity found in many state-of-the-art (SOTA) models that, while effective, tend to be over-parameterized and inefficient, particularly when applied to large-scale datasets like NTU RGB+D 60 and 120.

The primary contribution of the paper is the proposed EfficientGCN architectures, which are developed through the application of modern techniques such as separable convolutions, multiple input branches (MIB), and compound scaling strategies first exemplified in EfficientNet. EfficientGCN prioritizes reduced computational cost and model size while maintaining high accuracy, which makes it suitable for practical usage scenarios necessitating real-time performance.

The authors detail the development of the EfficientGCN by initially embedding advanced separable convolutional layers within an MIB structure. This incorporates an early fusion of multiple input features, significantly lowering the volume of trainable parameters in comparison to other architectures employing late fusion. Furthermore, the authors introduce a compound scaling method that adjusts the model’s width and depth in tandem, yielding a family of networks termed EfficientGCN-Bx. This approach effective balances accuracy and efficiency.

Critical empirical results demonstrate the superiority of the proposed EfficientGCN family over existing methods. Notably, the EfficientGCN-B4 model achieves a 92.1% accuracy on the NTU RGB+D 60 cross-subject benchmark, which is a marked improvement over previous models. The EfficientGCN provides approximately 5.82 times reduction in model size and is about 5.85 times faster than the MS-G3D model, a recognized benchmark in the field.

Further analysis involves ablation studies highlighting how changes in the architecture, such as different convolutional layers and attention mechanisms, impact performance. Moreover, the paper also investigates the use of a new attention module, Spatial Temporal Joint Attention (ST-JointAtt), which efficiently identifies the most important joints throughout each action sequence.

The implications of this work are significant for both practical applications and future research directions. Practically, models that maintain accuracy while reducing computational costs can be deployed on devices with restricted resources, including mobile and embedded systems. Theoretically, the methodology demonstrated for reducing over-parameterization could impact future design strategies for neural networks, encouraging more computationally efficient approaches.

In conclusion, the EfficientGCN establishes a promising base for further advancements in skeleton-based action recognition, providing an efficient alternative without compromising on recognition accuracy. This work contributes to a growing body of research focused on the balance between model simplicity and performance efficacy, suggesting a trajectory toward more efficient AI models in computer vision tasks.

PDF Markdown

Related Papers

GitHub

GitHub - yfsong0709/EfficientGCNv1 (23 stars)