An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition (1902.09130v2)

Published 25 Feb 2019 in cs.CV

Abstract: Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increases temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.

Authors (5)

Chenyang Si (36 papers)
Wentao Chen (39 papers)
Wei Wang (1793 papers)
Liang Wang (512 papers)
Tieniu Tan (119 papers)

Citations (666)

View on Semantic Scholar

Summary

The paper presents AGC-LSTM, a novel architecture that integrates graph convolution, LSTM units, and attention mechanisms to capture spatiotemporal dynamics in skeletal actions.
It achieves state-of-the-art accuracy on NTU RGB+D and Northwestern-UCLA datasets, with 95.0% on Cross-View and 89.2% on Cross-Subject evaluations, outperforming previous models.
The model’s temporal hierarchical design and targeted attention on key joints enhance efficiency and robustness, offering promising applications in real-time action recognition.

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

The paper introduces an innovative approach for skeleton-based action recognition by leveraging an Attention Enhanced Graph Convolutional Long Short-Term Memory Network (AGC-LSTM). This methodology uniquely integrates spatial and temporal dynamics of skeletal actions by employing graph structures combined with recurrent neural networks, enhancing both discriminative feature extraction and efficiency.

Key Contributions

The AGC-LSTM architecture diverges from traditional methodologies by integrating graph convolutional operations with LSTM units to form a structured spatiotemporal analysis mechanism. The design captures both individual joint dynamics and their correlative movements across time by processing graph-structured data rather than flat sequences. Critically, the model addresses the co-occurrence relationship that exists between spatial configurations and temporal dynamics, a challenge that previous models have struggled to effectively tackle.

Additionally, the inclusion of an attention mechanism selectively enhances the significance of key joints within each layer of the AGC-LSTM. This aspect allows the model to focus on particularly informative parts of the skeletal structure that contribute to the classification task, improving the accuracy and robustness of the model against redundant data.

A further contribution is the implementation of a temporal hierarchical architecture that expands the temporal receptive field of the upper AGC-LSTM layers, allowing for a more nuanced semantic understanding of complex action sequences while reducing computational overhead.

Experimental Results

The AGC-LSTM was evaluated on two established datasets: NTU RGB+D and Northwestern-UCLA. On the NTU RGB+D dataset, the AGC-LSTM surpassed contemporary state-of-the-art models with an accuracy of 95.0% on the Cross-View (CV) evaluation and 89.2% on Cross-Subject (CS) evaluation. On the Northwestern-UCLA dataset, it achieved an accuracy of 93.3%.

Significant performance improvements over previous models such as HCN, ST-GCN, and PB-GCN were observed, especially in terms of handling complex interactions and similar action classes. This is largely attributed to the attention mechanisms effectively isolating discriminative features and the model's hierarchical temporal pooling approach.

Implications and Future Work

Practically, the AGC-LSTM framework offers enhanced capabilities in accurately recognizing human actions from skeletal data, which has applications in areas like surveillance, human-computer interaction, and sports analytics. The model’s efficiency gains and accuracy improvements suggest its potential utility in real-time scenarios where computational resources are limited.

Theoretically, the paper contributes to the understanding of how graph structures can be effectively combined with recurrent models to handle spatiotemporal data. The successful integration of attention mechanisms within this context paves the way for further innovations in neural network architectures dealing with structured data.

Future research directions might explore integrating additional data modalities, such as object appearance alongside skeleton data, to address misclassification challenges that arise from similar action sequences. Expanding the architecture to incorporate pose-object relations could enhance action recognition performance further. Additionally, exploring other graph neural network variants or optimization strategies could yield additional improvements in temporal dynamic modeling.

In summary, the proposed AGC-LSTM framework represents a significant advancement in the domain of skeleton-based action recognition, offering both practical application improvements and theoretical insights into the integration of graph structures within neural networks.

PDF Markdown