P-CNN: Pose-based CNN Features for Action Recognition (1506.03607v2)

Published 11 Jun 2015 in cs.CV

Abstract: This work targets human action recognition in video. While recent methods typically represent actions by statistics of local video features, here we argue for the importance of a representation derived from human pose. To this end we propose a new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition. The descriptor aggregates motion and appearance information along tracks of human body parts. We investigate different schemes of temporal aggregation and experiment with P-CNN features obtained both for automatically estimated and manually annotated human poses. We evaluate our method on the recent and challenging JHMDB and MPII Cooking datasets. For both datasets our method shows consistent improvement over the state of the art.

Citations (597)

View on Semantic Scholar

Summary

The paper introduces a novel pose-based CNN that integrates motion and appearance features from individual body parts for improved action recognition.
It demonstrates that combining static and dynamic temporal aggregation schemes enhances recognition accuracy on benchmark datasets.
Experimental results show that P-CNN is robust to pose estimation errors and effectively complements Dense Trajectory features.

Pose-based CNN Features for Action Recognition

The paper presents a novel approach for human action recognition in video through the development of Pose-based Convolutional Neural Network descriptors (P-CNN). It emphasizes utilizing human pose to enhance the representation of actions, arguing that existing methods focused on local motion descriptors might not sufficiently capture fine-grained variations in actions.

Core Contributions

P-CNN Development: The authors introduce P-CNN, which combines motion and appearance features for different human body parts based on human pose tracks over time. This approach leverages CNNs to process individual body parts, capturing a more structured spatial and temporal representation of actions.
Temporal Aggregation Schemes: The paper explores various schemes for aggregating CNN features over time, such as max and min aggregation, as well as static and dynamic (temporal difference) features, showing through experimental results that a combination of these strategies yields improved recognition performance.
Pose Estimation and Evaluation: The research investigates the impact of using both automatically estimated and manually annotated human poses in constructing the P-CNN features. They test this approach on two complex datasets: JHMDB and MPII Cooking, demonstrating consistent improvements over current state-of-the-art methods.
Complementary Effect: The P-CNN features are shown to be complementary to Dense Trajectory features, and combining them significantly enhances the state-of-the-art in action recognition on the evaluated datasets.

Key Findings

P-CNN outperforms High-Level Pose Features (HLPF) especially in cases where reliable pose estimation is available. The robustness of P-CNN to pose estimation errors is markedly superior, maintaining a high recognition accuracy even with automatically estimated poses.
Utilizing CNNs specific to body parts, like hands and upper body, captures distinct motion and appearance cues that contribute significantly to differentiating subtle action variations.

Implications

This work provides a significant insight into leveraging pose information for action recognition and establishes a pathway for incorporating structured spatial and temporal configurations into model architectures. The promise of enhanced action discrimination through pose-based features underscores the potential of integrating these methods into broader computer vision applications, such as surveillance and human-computer interaction.

Future Directions

Fine-tuning for Parts: There is potential for further refinement by adapting CNNs for each body part through techniques like fine-tuning, which could enhance the specificity of feature extraction.
Temporal Modeling with RNNs: Investigating temporal evolutionary modeling with recurrent neural networks (RNNs) could offer another layer of temporal structure capture, potentially improving performance further.
Pose Estimation Advances: As pose estimation technologies advance, integrating more accurate pose information can further boost the effectiveness of pose-based action recognition methods.

In conclusion, the introduction of P-CNN represents a substantial contribution to the field, offering new insights into the utilization of human pose for enhancing action recognition performance. The findings and methodologies introduced in this paper can serve as a valuable resource for researchers aiming to improve video-based action recognition systems.