Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks (1611.02447v2)

Published 8 Nov 2016 in cs.CV

Abstract: Recently, Convolutional Neural Networks (ConvNets) have shown promising performances in many computer vision tasks, especially image-based recognition. How to effectively use ConvNets for video-based recognition is still an open problem. In this paper, we propose a compact, effective yet simple method to encode spatio-temporal information carried in $3D$ skeleton sequences into multiple $2D$ images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for real-time human action recognition. The proposed method has been evaluated on three public benchmarks, i.e., MSRC-12 Kinect gesture dataset (MSRC-12), G3D dataset and UTD multimodal human action dataset (UTD-MHAD) and achieved the state-of-the-art results.

Citations (341)

View on Semantic Scholar

Summary

The paper introduces a novel mapping of 3D skeleton data to 2D joint trajectory maps, enabling ConvNets to achieve state-of-the-art action recognition.
It employs advanced encoding strategies, such as directional color mapping and saturation adjustments, to enhance spatio-temporal feature extraction.
Late fusion of multiple ConvNet outputs from orthogonal projections significantly improves performance on benchmark datasets like MSRC-12, G3D, and UTD-MHAD.

Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks

The paper "Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks" introduces a novel framework for human action recognition utilizing 3D skeleton data in RGB-D videos. This approach capitalizes on the representational power of Convolutional Neural Networks (ConvNets) by effectively mapping the temporal dynamics and spatial configurations of joints into texture-like images called Joint Trajectory Maps (JTMs). The research addresses the persistent challenge of effectively integrating spatio-temporal information for video-based recognition tasks, a challenge that traditional methods often fail to meet comprehensively.

Methodological Contributions

The primary innovation of this paper is the transformation of spatial-temporal joint data sequences into JTMs, enabling ConvNets to exploit robust features for action differentiation. This mapping process involves:

Projection of Skeleton Sequences: The 3D trajectories of joints are projected onto three orthogonal 2D planes, furnishing ConvNets with spatially enriched images for recognition tasks.
Encoding Strategies: The JTMs are further enriched with complex encoding schemes:
- Directional Encoding via color maps captures trajectory orientation.
- Body Part Segmentation employs differentiated color maps to distinguish among diverse body segments.
- Motion Magnitude is depicted through variations in saturation and brightness, concurrently representing joint speed, effectively enhancing motion representation.
Late Fusion of ConvNets: Multiple ConvNets are trained on the JTMs from distinct spatial planes, and their outputs are synthesized towards final predictions, leveraging the complementary strengths of varied perspectives.

Experimental Results

Rigorous evaluations on established benchmarks such as MSRC-12 Kinect Gesture, G3D, and UTD-MHAD datasets demonstrate the supremacy of the proposed method, showing state-of-the-art accuracy. Notably:

On the MSRC-12 dataset, the method achieved a remarkable accuracy of 93.12%, surpassing existing techniques such as ELC-KSVD and Cov3DJ.
On the G3D and UTD-MHAD datasets, the results were equally robust, with accuracies of 94.24% and 85.81%, respectively, reflecting enhanced recognition capabilities.

Implications and Future Directions

This method's ability to convert 3D motion data into 2D imagery opens channels to integrate advanced computer vision techniques traditionally reserved for static images into dynamic applications. It also suggests broader implications for real-time applications, facilitated by efficient and rich data representation through JTMs. Such capabilities are critical for domains requiring rapid and precise action identification, such as human-computer interaction, video surveillance, and automated sports analytics.

In future research, exploring more sophisticated data augmentation strategies could unlock further potential. Additionally, examining the combination of this approach with other temporal modeling strategies, like attention mechanisms, might amplify its applicability and performance on more complex datasets.

Overall, this paper presents a noteworthy step forward in effectively bridging the gap between skeletal data representation and convolutional neural network applications for action recognition, promising an evolution in how spatio-temporal dynamics are processed in multimedia signal processing.

PDF Markdown