A New Representation of Skeleton Sequences for 3D Action Recognition (1703.03492v3)

Published 9 Mar 2017 in cs.CV

Abstract: This paper presents a new method for 3D action recognition with skeleton sequences (i.e., 3D trajectories of human skeleton joints). The proposed method first transforms each skeleton sequence into three clips each consisting of several frames for spatial temporal feature learning using deep neural networks. Each clip is generated from one channel of the cylindrical coordinates of the skeleton sequence. Each frame of the generated clips represents the temporal information of the entire skeleton sequence, and incorporates one particular spatial relationship between the joints. The entire clips include multiple frames with different spatial relationships, which provide useful spatial structural information of the human skeleton. We propose to use deep convolutional neural networks to learn long-term temporal information of the skeleton sequence from the frames of the generated clips, and then use a Multi-Task Learning Network (MTLN) to jointly process all frames of the generated clips in parallel to incorporate spatial structural information for action recognition. Experimental results clearly show the effectiveness of the proposed new representation and feature learning method for 3D action recognition.

Citations (763)

View on Semantic Scholar

Summary

The paper introduces a novel representation by converting skeleton sequences into three clips, enabling enhanced spatial-temporal feature learning for 3D action recognition.
It leverages a CNN model (using VGG19 conv5_1) and Multi-Task Learning to fuse temporal and spatial features, resulting in improved generalization across datasets.
Experimental evaluations show significant accuracy gains on NTU RGB+D (79.57% cross-subject, 84.83% cross-view) and robust performance on SBU Kinect and CMU datasets.

A New Representation of Skeleton Sequences for 3D Action Recognition

The paper "A New Representation of Skeleton Sequences for 3D Action Recognition" introduces a novel method for 3D action recognition using skeleton sequences, where skeleton sequences are defined as 3D trajectories of human skeleton joints. The authors propose a transformation of the skeleton sequences into three distinct clips, which are then utilized for spatial-temporal feature learning through deep neural networks.

Methodology

The key innovation in this work is the transformation of skeleton sequences into clips. Each skeleton sequence is converted into three clips, each corresponding to one channel of the cylindrical coordinates of the skeleton sequence. Each frame of these clips encapsulates the temporal dynamics of the entire sequence while incorporating specific spatial relationships between the joints. To capture a comprehensive view of the spatial structure, all clips aggregate frames with various spatial relationships, which convey significant structural information of the human skeleton.

The approach leverages deep convolutional neural networks (CNNs) to learn long-term temporal information from the frames of these clips. Specifically, a temporally invariant representation is extracted from each frame using a pre-trained VGG19 CNN model, wherein the convolutional layer conv5_1 is employed to derive feature maps. Temporal mean pooling is applied along the temporal dimension to obtain a compact feature representation for each frame.

Subsequently, the extracted CNN features corresponding to the same time-step from the three clips are concatenated into a single feature vector. To enhance the recognition process, the authors introduce a Multi-Task Learning Network (MTLN). The MTLN processes the concatenated feature vectors from different time-steps simultaneously to exploit their intrinsic relationships, employing weight sharing to improve the generalization performance. This methodology allows the network to incorporate both spatial structures and temporal dynamics for more robust action recognition.

Experimental Evaluations

The proposed method is validated on three datasets: NTU RGB+D, SBU Kinect Interaction, and CMU. The NTU RGB+D dataset, which contains over 56,000 sequences and 60 action classes, serves as a robust platform for evaluation using cross-subject and cross-view protocols. Experimental results demonstrate significant improvements over existing methods, with the proposed method achieving 79.57% accuracy in cross-subject evaluation and 84.83% in cross-view evaluation.

In comparison, the SBU Kinect Interaction dataset and the CMU dataset, which contain fewer sequences, also reflect notable performance gains. For the SBU dataset, a 5-fold cross-validation approach results in an accuracy of 93.57%, while for the CMU dataset, the method achieves 93.22% accuracy on a subset and 88.30% on the entire dataset. These gains are attributed to the comprehensive feature learning and representation capabilities of the proposed method, showcasing its robustness even in datasets with noisy joint positions and large sequence variations.

Implications and Future Directions

The transformation of skeleton sequences into clips followed by CNN and MTLN-based learning signifies a considerable advancement in 3D action recognition. This representation provides a more robust and comprehensive approach to modeling temporal dynamics and spatial structures simultaneously, opening avenues for enhancing action recognition systems in various applications such as surveillance, healthcare, and human-computer interaction.

Future developments could explore the integration of more advanced temporal pooling techniques and the utilization of other pre-trained CNN architectures to potentially improve performance further. Additionally, investigating methods to handle skeleton sequences with varying numbers of joints and incorporating attention mechanisms to emphasize significant joint movements could provide promising directions for continued research.

Conclusion

The paper presents a method that leverages deep learning for effective 3D action recognition through a novel representation of skeleton sequences. By generating and learning from clips that encapsulate spatial-temporal information, the proposed approach demonstrates superior performance across multiple datasets. This advancement underscores the potential of combining CNNs with MTLN for enhanced feature learning and recognition in complex action sequences.

PDF Markdown