Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks (1912.06971v1)

Published 15 Dec 2019 in cs.CV

Abstract: Graph convolutional networks (GCNs), which generalize CNNs to more generic non-Euclidean structures, have achieved remarkable performance for skeleton-based action recognition. However, there still exist several issues in the previous GCN-based models. First, the topology of the graph is set heuristically and fixed over all the model layers and input data. This may not be suitable for the hierarchy of the GCN model and the diversity of the data in action recognition tasks. Second, the second-order information of the skeleton data, i.e., the length and orientation of the bones, is rarely investigated, which is naturally more informative and discriminative for the human action recognition. In this work, we propose a novel multi-stream attention-enhanced adaptive graph convolutional neural network (MS-AAGCN) for skeleton-based action recognition. The graph topology in our model can be either uniformly or individually learned based on the input data in an end-to-end manner. This data-driven approach increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Besides, the proposed adaptive graph convolutional layer is further enhanced by a spatial-temporal-channel attention module, which helps the model pay more attention to important joints, frames and features. Moreover, the information of both the joints and bones, together with their motion information, are simultaneously modeled in a multi-stream framework, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.

Citations (384)

View on Semantic Scholar

Summary

The paper presents a novel adaptive graph learning approach that automatically optimizes graph topologies per layer and sample.
It introduces a Spatial-Temporal-Channel attention module to highlight key joints, frames, and features for improved accuracy.
The multi-stream framework integrates joint, bone, and motion data, yielding significant performance gains on benchmark datasets.

Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks

The paper titled "Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks" presents a sophisticated enhancement to the use of Graph Convolutional Networks (GCNs) for skeleton-based action recognition. The authors address several limitations present in earlier GCN models and propose a novel approach, the Multi-Stream Attention-Enhanced Adaptive Graph Convolutional Network (MS-AAGCN), designed to overcome these challenges.

Key Contributions

Adaptive Graph Learning: Traditional methods often use a fixed, heuristic graph topology that doesn't adapt to different layers or diverse data inputs. This paper proposes an adaptive graph convolutional layer that dynamically learns and optimizes the graph topology in an end-to-end manner, enhancing the model's flexibility and effectiveness.
Attention Mechanisms: The incorporation of a Spatial-Temporal-Channel (STC) attention module in the adaptive graph convolutional layers allows the model to focus on crucial joints, frames, and features, thereby improving action recognition accuracy.
Multi-Stream Framework: The paper innovatively integrates both first-order (joint) and second-order (bone) information, along with their motion information, into a multi-stream framework. This approach significantly bolsters the model's recognition accuracy across different actions.

Methodology Overview

The proposed MS-AAGCN framework offers a sophisticated mechanism to automatically learn graph topologies tailored for specific action recognition tasks. The adaptive graph convolutional layer includes two graph types: a global graph derived from a data-driven approach using the dataset information, and an individual graph that adjusts according to the input sample's feature similarity. These graphs are optimized per layer and sample, providing a dynamic and data-dependent solution.

The attention module further refines graph convolution by adaptively recalibrating activations across joints, frames, and channels. This design strategically focuses computational resources on the most discriminative parts of the input, thereby enhancing performance.

Performance and Implications

The authors evaluate MS-AAGCN on two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrating superior performance over existing state-of-the-art models. Specifically, the model achieves an accuracy improvement of 7.9% and 8.5% on the cross-view (CV) and cross-subject (CS) benchmarks of the NTU-RGBD dataset, respectively. The model's ability to learn contextual and hierarchical representations from the data is evident in the impressive results.

Practical and Theoretical Implications

Practically, the proposed method can be highly beneficial in applications such as video surveillance and human-computer interaction where accurate action recognition is crucial. Theoretically, the paper paves the way for further explorations into adaptive and attention-based graph convolutional networks that can be applied in broader contexts beyond action recognition.

Future Developments

Future research could explore the integration of RGB data with skeleton data more effectively, leveraging the strengths of both modalities. Additionally, the combination of action recognition with pose estimation frameworks could yield more holistic and efficient systems.

In summary, this paper presents a significant advancement in skeleton-based action recognition by introducing a dynamic and adaptive graph convolutional framework enhanced with attention mechanisms. The multi-stream approach broadens the scope of information utilized, leading to robust performance improvements, and sets the stage for future innovations in this domain.

Related Papers

GitHub

GitHub - lshiwjx/2s-AGCN: Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition in CVPR19 (688 stars)