- The paper introduces EfficientGCN, a streamlined architecture that significantly reduces computational complexity for skeleton-based action recognition.
- It integrates separable convolutions within a multiple input branch design along with compound scaling to optimize performance.
- Empirical results show EfficientGCN-B4 achieves 92.1% accuracy on NTU RGB+D 60 while being over 5 times faster and smaller than MS-G3D.
Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition
The paper "Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition" introduces an efficient approach to improve the performance of Graph Convolutional Networks (GCNs) in the context of skeleton-based human action recognition. This paper addresses the substantial complexity found in many state-of-the-art (SOTA) models that, while effective, tend to be over-parameterized and inefficient, particularly when applied to large-scale datasets like NTU RGB+D 60 and 120.
The primary contribution of the paper is the proposed EfficientGCN architectures, which are developed through the application of modern techniques such as separable convolutions, multiple input branches (MIB), and compound scaling strategies first exemplified in EfficientNet. EfficientGCN prioritizes reduced computational cost and model size while maintaining high accuracy, which makes it suitable for practical usage scenarios necessitating real-time performance.
The authors detail the development of the EfficientGCN by initially embedding advanced separable convolutional layers within an MIB structure. This incorporates an early fusion of multiple input features, significantly lowering the volume of trainable parameters in comparison to other architectures employing late fusion. Furthermore, the authors introduce a compound scaling method that adjusts the model’s width and depth in tandem, yielding a family of networks termed EfficientGCN-Bx. This approach effective balances accuracy and efficiency.
Critical empirical results demonstrate the superiority of the proposed EfficientGCN family over existing methods. Notably, the EfficientGCN-B4 model achieves a 92.1% accuracy on the NTU RGB+D 60 cross-subject benchmark, which is a marked improvement over previous models. The EfficientGCN provides approximately 5.82 times reduction in model size and is about 5.85 times faster than the MS-G3D model, a recognized benchmark in the field.
Further analysis involves ablation studies highlighting how changes in the architecture, such as different convolutional layers and attention mechanisms, impact performance. Moreover, the paper also investigates the use of a new attention module, Spatial Temporal Joint Attention (ST-JointAtt), which efficiently identifies the most important joints throughout each action sequence.
The implications of this work are significant for both practical applications and future research directions. Practically, models that maintain accuracy while reducing computational costs can be deployed on devices with restricted resources, including mobile and embedded systems. Theoretically, the methodology demonstrated for reducing over-parameterization could impact future design strategies for neural networks, encouraging more computationally efficient approaches.
In conclusion, the EfficientGCN establishes a promising base for further advancements in skeleton-based action recognition, providing an efficient alternative without compromising on recognition accuracy. This work contributes to a growing body of research focused on the balance between model simplicity and performance efficacy, suggesting a trajectory toward more efficient AI models in computer vision tasks.