Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates (1706.08276v1)

Published 26 Jun 2017 in cs.CV

Abstract: Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.

Citations (430)

View on Semantic Scholar

Summary

The paper introduces a novel ST-LSTM architecture that integrates spatio-temporal dynamics with trust gates to assess input reliability.
The method leverages a tree-structured traversal and multi-modal fusion to enhance feature representation and mitigate noise.
Experiments on benchmark datasets demonstrate improved accuracy and robustness over state-of-the-art action recognition techniques.

Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates

The paper titled "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates" explores the domain of human action recognition using 3D skeletal data with a focus on the advancements made possible by recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) models. The main objective of this research is to improve the accuracy and robustness of action recognition from skeletal data by leveraging both spatial and temporal dependencies through a novel network architecture.

Overview of the Proposed Method

The paper extends conventional recurrent neural network models, particularly LSTMs, to simultaneously capture both spatial and temporal dynamics inherent in skeletal movement data. The proposed Spatio-Temporal LSTM (ST-LSTM) network is designed to analyze the connections and patterns in skeletal joints by modeling:

Temporal Dependencies: The transitions and movements across different time frames. Traditional LSTM models focus on these dependencies, capturing how joint positions evolve over time.
Spatial Dependencies: The simultaneous relationships and configurations between different joints within the same frame. The spatial context is crucial for understanding static posture-related information.

To efficiently handle these dependencies, the authors introduce a tree-structured traversal mechanism that respects the natural kinematic connections between joints, unlike simple joint chains that do not account for these relationships. This traversal method allows for a more semantically meaningful analysis.

Trust Gate Mechanism

A noteworthy innovation in this research is the introduction of a "trust gate" mechanism within the ST-LSTM units. This gate assesses the trustworthiness of input data at each spatio-temporal step by predicting the next input and comparing it with the actual input. The trust gate effectively downweights noisy or inaccurate skeletal measurements, thereby enhancing the model's robustness against data noise and occlusion often encountered with depth sensors like Kinect.

Another contribution is the integration of multi-modal feature fusion within the ST-LSTM, allowing the model to leverage both geometric features (i.e., 3D joint positions) and complementary visual descriptors derived from RGB data, like HOG and HOF features. This fusion is not simply at the input level but is internally managed within the network to better handle the disparity between feature modalities.

Experimental Evaluation

The authors conduct extensive experiments across seven challenging benchmark datasets, such as NTU RGB+D, UT-Kinect, and SBU Interaction, among others. The results demonstrate that the proposed ST-LSTM with trust gates not only outperforms several state-of-the-art methods but also shows significant improvements when dealing with noisy data, as proven by experiments with artificially degraded datasets.

In comparisons against models that handle data in a simple joint chain, the transformation to a tree structure significantly enhanced performance. The trust gate further boosted accuracy by dynamically modulating the influence of inputs based on their predicted reliability.

Implications and Future Directions

The findings from this paper present meaningful implications for the development of robust human action recognition systems, particularly for applications in surveillance, human-computer interaction, and robotics, where real-time and accurate interpretation of human actions is critical.

The use of trust gates presents a valuable technique for handling noisy input data, a common challenge in real-world scenarios. This mechanism could be extended and potentially generalized to other data domains where input noise and occlusion are prevalent.

Looking forward, the integration of more sophisticated deep learning models, such as transformers in multi-modal learning, could further enhance the capabilities of skeleton-based action recognition systems. Additionally, exploring unsupervised or semi-supervised versions of the ST-LSTM could open pathways for leveraging unlabeled data, which is abundant in practical applications.

Overall, the work significantly advances the field by addressing spatial and temporal data synthesis with innovative gating mechanisms and adequately combining multi-modal features for action recognition.