Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks (2008.07404v4)

Published 17 Aug 2020 in cs.CV

Abstract: Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving backbone results. Compared with methods that use the same input data, the proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.

Citations (272)

View on Semantic Scholar

Summary

The paper presents the ST-TR model that leverages spatial and temporal self-attention to overcome limitations of fixed-graph methods in action recognition.
It features distinct Spatial Self-Attention and Temporal Self-Attention modules that dynamically learn joint dependencies from 3D skeleton data.
Experimental results on NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400 demonstrate improved accuracy and efficiency compared to baseline methods.

An Evaluation of Spatial-Temporal Transformer Networks for Skeleton-Based Human Action Recognition

This paper presents a novel approach for skeleton-based human action recognition by employing Spatial and Temporal Transformer Networks (ST-TR) that leverage the Transformer self-attention mechanism. Aimed at addressing limitations inherent in Spatial-Temporal Graph Convolutional Networks (ST-GCNs), the ST-TR model is designed to enhance the capturing of spatial and temporal dependencies across motion sequences represented as graphs—specifically, the 3D skeleton data.

Model Design and Architecture

The ST-TR model consists of a two-stream network incorporating two primary modules: Spatial Self-Attention (SSA) and Temporal Self-Attention (TSA). Each module targets distinct dependencies where the SSA focuses on intra-frame spatial dynamics, allowing the model to understand relationships between joints within the same skeleton frame. On the other hand, TSA emphasizes inter-frame correlations by modeling the dynamics of each joint over time. The integration of these modules within separate pathways enables nuanced extraction of spatial and temporal features.

The SSA and TSA complement graph convolutions in standard methodologies but with enhanced flexibility. Unlike traditional methods constrained by fixed graph topologies, the ST-TR model dynamically learns joint dependencies, adapting to the specifics of each action sequence. This adaptability provides robustness in capturing complex action dynamics.

Experimental Validation and Results

To validate the proposed methodology, experiments were conducted across three large-scale datasets: NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400. The ST-TR consistently outperformed existing architectures, achieving state-of-the-art performance when using joint coordinates as input and competitive results when additional bone information was integrated. Specifically, ST-TR obtained improvements over baseline ST-GCN and AGCN configurations, both in accuracy and efficiency.

The introduction of the self-attention mechanism in SSA and TSA led to parameter efficiency, reducing the number of parameters while maintaining or even improving performance metrics. Notably, the temporal module TSA demonstrated a significant reduction in computational complexity when compared with traditional temporal convolutional approaches.

Implications and Speculation on Future Directions

The integration of self-attention within spatial-temporal graph-based models offers substantial implications for the field of human action recognition. This design paradigm provides not only a deeper understanding of spatial and temporal dependencies but also promotes modular network design that can be potentially adapted to other graph-based tasks beyond human action recognition.

From a practical standpoint, these results suggest that self-attention can be broadly applicable to other modalities where structured graph data exists. Future research might explore extending this self-attentional framework to multi-modal datasets that include richer contextual information, such as audio and environmental cues, alongside skeletal data.

The implementation details, including publicly available code, pave the way for further experimentation and adoption of the ST-TR framework. As models incorporating Transformers gain momentum across computer vision domains, the ST-TR methodology stands as a promising avenue for advancing human interaction technologies, providing tools for comprehensive understanding of complex movement patterns.

In conclusion, this paper introduces an innovative pathway for action recognition in non-Euclidean spaces, establishing a benchmark for future developments in spatial-temporal analysis using self-attention mechanisms.

PDF Markdown