Skeleton-based Action Recognition with Convolutional Neural Networks (1704.07595v1)

Published 25 Apr 2017 in cs.CV

Abstract: Current state-of-the-art approaches to skeleton-based action recognition are mostly based on recurrent neural networks (RNN). In this paper, we propose a novel convolutional neural networks (CNN) based framework for both action classification and detection. Raw skeleton coordinates as well as skeleton motion are fed directly into CNN for label prediction. A novel skeleton transformer module is designed to rearrange and select important skeleton joints automatically. With a simple 7-layer network, we obtain 89.3% accuracy on validation set of the NTU RGB+D dataset. For action detection in untrimmed videos, we develop a window proposal network to extract temporal segment proposals, which are further classified within the same network. On the recent PKU-MMD dataset, we achieve 93.7% mAP, surpassing the baseline by a large margin.

Authors (4)

Chao Li (430 papers)
Qiaoyong Zhong (12 papers)
Di Xie (57 papers)
Shiliang Pu (106 papers)

Citations (345)

View on Semantic Scholar

Summary

The paper introduces a CNN-based framework that leverages raw skeleton coordinates and motion data, enhancing action recognition performance.
The proposed seven-layer architecture with a skeleton transformer module achieves 89.3% accuracy on the NTU RGB+D dataset, outperforming RNN models.
For action detection, a window proposal network adapts a temporal Faster R-CNN, reaching a 93.7% mAP on the PKU-MMD dataset.

Skeleton-based Action Recognition with Convolutional Neural Networks

The paper "Skeleton-based Action Recognition with Convolutional Neural Networks" by Chao Li et al. presents a novel approach to action recognition and detection using convolutional neural networks (CNNs) as opposed to the more commonly used recurrent neural networks (RNNs) in this domain. The authors provide a thorough examination of the potential of CNNs to model temporal patterns in skeletal data and extensively compare their performance to existing methods.

Methodology

The authors propose a CNN-based framework that processes raw skeleton coordinates in addition to skeleton motion for action classification and detection. The proposed method leverages a skeleton transformer module designed to autonomously rearrange and select significant joints, thereby optimizing action recognition. The network architecture, consisting of a modest seven layers, exhibits efficacy with 89.3% accuracy on the NTU RGB+D dataset for action classification.

For action detection within untrimmed videos, the authors adapt the Faster R-CNN trajectory for the temporal domain by developing a window proposal network (WPN) which identifies potential temporal segments for classification. The framework achieves a mean average precision (mAP) of 93.7% on the PKU-MMD dataset, which significantly exceeds baseline results.

Results

The results presented in this paper are notable for several reasons:

Action Classification: The CNN framework outperforms existing RNN-based models such as STA-LSTM by significant margins in both cross-subject and cross-view scenarios. The introduction of skeleton motion and the transformer module contribute to this enhanced performance.
Action Detection: The robustness of the CNN model is validated on the PKU-MMD dataset, demonstrating substantial improvements over previous methods such as JCRRNN with a notable 58% increase in mAP.

Implications and Future Directions

The implications of this research are manifold. From a practical standpoint, the ability of CNNs to handle skeletal data for action recognition offers a more computationally efficient alternative to RNNs, particularly for real-time applications. The proposed method's ability to convert temporal action detection into a problem of unidimensional object detection represents a substantial shift in the approach to analyzing video data.

Theoretically, this work opens avenues for further exploration into CNN-based methodologies for other forms of sequential data beyond skeleton sequences. Future research could focus on expanding the capabilities of the proposed CNN framework to facilitate online detection for real-time environments, a noted limitation in the current paper.

In conclusion, this research contributes significantly to the field by showcasing the potential and effectiveness of CNN architectures in skeleton-based action recognition and detection, encouraging further exploration into CNN applications in sequential data modeling.

PDF Markdown