Spatial Temporal Transformer Network for Skeleton-based Action Recognition (2012.06399v1)

Published 11 Dec 2020 in cs.CV

Abstract: Skeleton-based human action recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views, and complex background. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data on both NTU-RGB+D 60 and NTU-RGB+D 120.

Authors (3)

Chiara Plizzari (14 papers)
Marco Cannici (20 papers)
Matteo Matteucci (91 papers)

Citations (178)

View on Semantic Scholar

Summary

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

The paper "Spatial Temporal Transformer Network for Skeleton-based Action Recognition" describes a novel approach to advance skeleton-based human action recognition using Transformer networks. This task has garnered substantial attention due to the robust nature of skeleton data in accommodating dynamic variations such as illumination changes and camera perspectives. Despite previous advances, effectively encoding latent information from 3D skeleton data remains challenging.

This research introduces a Spatial-Temporal Transformer Network (ST-TR), comprising two primary components: the Spatial Self-Attention (SSA) module and the Temporal Self-Attention (TSA) module. These modules employ the Transformer self-attention mechanism, a technique initially developed for NLP tasks, to capture intricate dependencies between skeletal joints. This crucial adaptation provides superior modeling of intra-frame interactions through SSA and inter-frame correlations through TSA.

Key Contributions and Results

The contributions of this paper can be detailed as follows:

Novel Two-stream Architecture: The paper proposes an architecture that leverages both spatial and temporal dimensions through self-attention mechanisms. This design operates on the premise that action recognition benefits more from adaptive exploration of joint correlations rather than fixed structural representations.
Spatial Self-Attention (SSA): The SSA module dynamically models intra-frame dependencies between joints. This feature allows the network to comprehend complex body interactions not easily captured by graph convolution networks (GCN).
Temporal Self-Attention (TSA): The TSA module captures inter-frame dynamics, permitting the analysis of action sequences over extended frames. This capability helps extract meaningful temporal patterns despite potential misalignments or irregularities in joint-motion sequences.
Superior Performance: Empirical evaluation on NTU-RGB+D datasets demonstrates that the proposed ST-TR network outperforms existing state-of-the-art models. Specifically, the ST-TR achieves higher accuracy rates on both NTU-RGB+D 60 and NTU-RGB+D 120 datasets. These improvements underscore the efficacy of self-attention over conventional convolutional approaches.

Implications and Future Directions

The ST-TR model significantly enhances the performance of skeleton-based action recognition systems while ensuring a reduction in parameter complexity. It paves the way for further exploration into Transformer-based models in other dimensions of AI, such as video understanding and spatial dynamics in robotics.

Future work might delve into refining the self-attention modules further, optimizing the architecture for real-time tasks, or exploring the application to multimodal data integration. Additionally, harnessing fine-tuned versions of Transformer networks might offer improvements within specialized industries, like healthcare monitoring and fatigue analysis in athletes.

The use of self-attention in skeleton data represents a promising avenue for research in understanding complex human motions. This framework not only supports improved accuracy but also opens new avenues for theoretical exploration and practical implementations across varied domains in artificial intelligence.

PDF Markdown

Spatial Temporal Transformer Network for Skeleton-based Action Recognition (2012.06399v1)

Summary

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Key Contributions and Results

Implications and Future Directions

Related Papers