Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition (1607.07043v1)

Published 24 Jul 2016 in cs.CV, cs.AI, cs.LG, and cs.NE

Abstract: 3D action recognition - analysis of human actions based on 3D skeleton data - becomes popular recently due to its succinctness, robustness, and view-invariant representation. Recent attempts on this problem suggested to develop RNN-based learning methods to model the contextual dependency in the temporal domain. In this paper, we extend this idea to spatio-temporal domains to analyze the hidden sources of action-related information within the input data over both domains concurrently. Inspired by the graphical structure of the human skeleton, we further propose a more powerful tree-structure based traversal method. To handle the noise and occlusion in 3D skeleton data, we introduce new gating mechanism within LSTM to learn the reliability of the sequential input data and accordingly adjust its effect on updating the long-term context information stored in the memory cell. Our method achieves state-of-the-art performance on 4 challenging benchmark datasets for 3D human action analysis.

Citations (1,058)

View on Semantic Scholar

Summary

The paper presents an innovative ST-LSTM model that concurrently processes spatial and temporal data using trust gates to mitigate noisy inputs.
It employs a tree-structure based joint traversal to preserve kinematic relationships and effectively capture human motion dynamics.
Experimental results demonstrate state-of-the-art performance, achieving up to 77.7% accuracy on benchmarks like NTU RGB+D and SBU Interaction.

Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

The paper "Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition" by Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang presents a novel approach to enhancing 3D human action recognition through the introduction of a Spatio-Temporal Long Short-Term Memory (ST-LSTM) network enhanced with trust gates.

In the domain of human action recognition, the utilization of 3D skeleton data has gained traction for its robustness and view-invariant representation. Traditional methods predominantly focus on temporal modeling—leveraging Recurrent Neural Networks (RNNs) to capture temporal dynamics of human motion. However, these methods often neglect the spatial configuration of joints in each frame, which is equally vital for accurate action recognition.

Model Architecture and Innovations

The proposed ST-LSTM model diverges from conventional approaches by concurrently modeling spatial and temporal domains. Each ST-LSTM unit processes individual joints over both domains, thus encoding more comprehensive context information. Specifically, each joint receives hidden representations not just from previous frames but also from neighboring joints within the same frame. This dual-domain modeling is crucial for capturing both motion dynamics and spatial dependencies inherent in action recognition tasks.

To better represent the structural information of human joints, a tree-structure based traversal method is proposed. This method leverages the adjacency properties of the skeletal data by organizing joints in a tree-like graph rather than a simple chain, thereby preserving kinematic relationships.

Another key innovation is the introduction of a "trust gate" mechanism within the LSTM units. Given that 3D skeleton data acquired from depth sensors such as Microsoft Kinect can be noisy and occluded, the trust gate evaluates the reliability of input data at each spatio-temporal step. By analyzing discrepancies between actual input and a predicted input derived from contextual information, the trust gate dynamically adjusts the impact of noisy data on the memory cell states.

Experimental Results

The paper provides extensive experimental results on four benchmark datasets: NTU RGB+D, SBU Interaction, UT-Kinect, and Berkeley MHAD. The proposed model consistently achieves state-of-the-art performance across these datasets, underscoring its effectiveness. For instance:

On the NTU RGB+D dataset, the ST-LSTM with trust gate achieved 69.2% accuracy for cross-subject evaluation and 77.7% for cross-view evaluation, outperforming the baseline methods significantly.
On the SBU Interaction dataset, the model reached an accuracy of 93.3%, marking a substantial improvement over other skeleton-based methods.

These results can be attributed to the robust handling of noisy input via the trust gate and the comprehensive spatio-temporal modeling through the tree-traversal strategy.

Implications and Future Work

The implications of this research are both practical and theoretical. Practically, the enhanced accuracy in 3D action recognition tasks has direct applications in surveillance, human-computer interaction, and sports analytics. Theoretically, the combination of spatio-temporal RNNs with trust mechanisms paves the way for more resilient and adaptable neural architectures in handling corrupted sensory data.

Future developments could explore extending the trust gate mechanism to other forms of input noise and occlusion beyond skeletal data. Additionally, integrating this framework with other sensory inputs like RGB-D data could further enhance performance and applicability in complex real-world scenarios.

In conclusion, the ST-LSTM with trust gates represents a significant step forward in 3D human action recognition. By effectively addressing both spatial dependencies and input noise, this model sets a new benchmark for accuracy and robustness in the field.

PDF Markdown