Co-occurrence Feature Learning for Skeleton based Action Recognition using Regularized Deep LSTM Networks (1603.07772v1)

Published 24 Mar 2016 in cs.CV and cs.LG

Abstract: Skeleton based action recognition distinguishes human actions using the trajectories of skeleton joints, which provide a very good representation for describing actions. Considering that recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM) can learn feature representations and model long-term temporal dependencies automatically, we propose an end-to-end fully connected deep LSTM network for skeleton based action recognition. Inspired by the observation that the co-occurrences of the joints intrinsically characterize human actions, we take the skeleton as the input at each time slot and introduce a novel regularization scheme to learn the co-occurrence features of skeleton joints. To train the deep LSTM network effectively, we propose a new dropout algorithm which simultaneously operates on the gates, cells, and output responses of the LSTM neurons. Experimental results on three human action recognition datasets consistently demonstrate the effectiveness of the proposed model.

Authors (7)

Wentao Zhu (73 papers)
Cuiling Lan (60 papers)
Junliang Xing (80 papers)
Wenjun Zeng (130 papers)
Yanghao Li (43 papers)
Li Shen (363 papers)
Xiaohui Xie (84 papers)

Citations (850)

View on Semantic Scholar

Summary

Co-occurrence Feature Learning for Skeleton-based Action Recognition using Regularized Deep LSTM Networks

The paper "Co-occurrence Feature Learning for Skeleton-based Action Recognition using Regularized Deep LSTM Networks" by Wentao Zhu et al. presents an advanced methodology for recognizing human actions from skeleton joint trajectories utilizing a fully connected deep Long Short-Term Memory (LSTM) neural network. The approach integrates multiple innovative regularization techniques to learn co-occurrence features and enhance model training through specialized dropout algorithms, demonstrating efficacy across several benchmark datasets.

The fundamental premise of the research lies in leveraging LSTM's capability to model long-term temporal dependencies in sequential data, which is particularly well-suited for dynamic skeleton-based action recognition. The authors construct an end-to-end network composed of alternating LSTM and feedforward layers to robustly capture and analyze motion information from the input skeleton sequences.

Core Contributions

Co-occurrence Feature Learning:
- The novelty in this model is the introduction of a regularization scheme designed specifically to learn co-occurrence features of skeleton joints. The concept is based on the observation that human actions are characterized by the concurrent movements of certain joints.
- The model hierarchically groups neurons to explore different conjunctions of joints and applies an $\ell_{21}$ norm, promoting the discovery of column-sparse weight matrices conducive to focusing on these joint subsets.
In-depth Dropout for LSTM Networks:
- A distinctive dropout mechanism is proposed where dropout is applied not only to the output responses but also to internal components of the LSTM neurons, including input gates, forget gates, cells, and output gates.
- This dropout scheme retains the integrity of time-based connections while applying regularization along the layers, enhancing the model's ability to generalize and prevent overfitting.

Experimental Results

The effectiveness of the proposed methodologies is empirically validated on three human action recognition datasets: SBU Kinect Interaction, HDM05, and CMU Motion Capture. The model consistently outperformed other state-of-the-art algorithms, highlighting several key numerical results:

SBU Kinect Interaction Dataset: Achieved an average accuracy of 90.41%, representing a significant improvement over previous methods that capped at approximately 80%.
HDM05 Dataset: Yielded an accuracy of 88.53% on 65 classes, showcasing superior performance compared to traditional multi-layer perception models and hierarchical RNNs.
CMU Motion Capture Dataset: On this comprehensive dataset of 45 classes, the model achieved an accuracy of 83.72%, demonstrating robustness across a broader spectrum of action categories.

Implications and Future Prospects

The proposed framework sets a precedent in capturing complex inter-joint dependencies and temporal dynamics for skeleton-based action recognition. The co-occurrence feature learning allows the model to adaptively focus on relevant joints, while the in-depth dropout technique significantly enhances model robustness and generalization.

From a practical standpoint, these advancements can be seamlessly integrated into various applications including intelligent video surveillance, human-computer interaction, and advanced video understanding systems. The accurate and efficient recognition of human actions from skeleton data has substantial implications for real-time activity monitoring and interactive systems.

Theoretically, this paper opens several avenues for further exploration. Future research can focus on extending the co-occurrence learning framework to other modalities beyond skeleton data, such as integrating visual and auditory cues to form a multimodal action recognition system. Additionally, there is potential for refining the co-occurrence regularization techniques to facilitate end-to-end training in even deeper network architectures.

In conclusion, the presented work by Zhu et al. is integral in advancing skeleton-based action recognition through innovative regularization and dropout strategies in deep LSTM networks. The demonstrated improvements across multiple datasets underscore the potential for these techniques to substantially enhance the accuracy and robustness of action recognition models in both research and practical applications.

PDF Markdown