- The paper presents a hierarchical CNN model that learns global co-occurrence features by repositioning joint dimensions as channels.
- It incorporates a two-stream paradigm to explicitly model temporal dynamics using raw skeleton coordinates and their differences.
- Experimental results on benchmark datasets demonstrate significant accuracy improvements over RNN/LSTM-based methods in action recognition and detection.
Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation
The paper presents a novel framework for skeleton-based human action recognition and detection, employing a convolutional neural network (CNN) model to leverage the co-occurrence features inherent in skeletal data. The authors propose a hierarchical aggregation approach that enhances both intra-frame and inter-frame representations, targeting improved performance on tasks such as intelligent surveillance, human-computer interaction, and robotics.
Methodology
The core contribution lies in a CNN-based framework that learns hierarchical co-occurrence features from skeleton sequences. This approach contrasts with traditional methods utilizing RNNs or LSTMs that struggled with direct high-level feature extraction. The proposed method focuses on two main aspects:
- Global Co-occurrence Feature Learning: Unlike other CNN-based methods that aggregate local co-occurrence features, this framework repositions joint dimensions into channels, allowing CNN layers to capture extensive joint interactions aggregatively. This is crucial for recognizing actions that involve complicated joint interactions.
- Temporal Motion Encoding: By introducing a two-stream paradigm, the framework explicitly incorporates temporal joint movements using raw skeleton coordinates and their temporal differences. This explicit modeling is argued to enhance recognition performance by integrating dynamic motion aspects.
- Hierarchical Aggregation: The network architecture, termed the Hierarchical Co-occurrence Network (HCN), is designed to gradually build from point-level features to sophisticated co-occurrence features, facilitating the CNN’s ability to encode spatial-temporal dynamics effectively.
- Multi-Person Scalability: The framework offers strategies to handle multi-person scenarios, employing late fusion techniques such as element-wise maximum operations to integrate features from multiple subjects efficiently. This flexibility extends the framework’s applicability to interactive human actions.
Experimental Results
The HCN framework was evaluated on several benchmark datasets, including NTU RGB+D, SBU Kinect Interaction, and PKU-MMD, consistently outperforming existing state-of-the-art methods. Notably:
- On NTU RGB+D, the proposed method achieved significant improvements, with accuracy enhancements of 7.3% in cross-subject settings when compared to LSTM-based approaches.
- On SBU Kinect, the HCN delivered a substantial 8.2% increase over RNN-based methods, demonstrating remarkable efficiency in smaller datasets.
- In temporal action detection on PKU-MMD, the framework exhibited increased performance in mean average precision (mAP), underscoring its generalization ability across tasks.
Implications and Future Work
The proposed HCN framework exemplifies a robust approach to skeleton-based action recognition and detection, offering key insights into the value of global co-occurrence feature learning. It suggests that CNNs, when designed to encompass joint interactions comprehensively, can significantly enhance the model’s recognition capabilities.
The hierarchical and explicit aggregation techniques open avenues for further exploration in adapting CNNs to inherently sequential tasks like action recognition. Future directions could involve integrating this framework with multi-modal data or extending it to more complex multi-person interactions to further improve adaptability and comprehensiveness in action modeling.
In summary, the presented work underscores the viability of using CNNs for effective skeleton-based action recognition, challenging conventional reliance on sequential models and setting a new precedent for future research in this domain.