Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation (1804.06055v1)

Published 17 Apr 2018 in cs.CV

Abstract: Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: the intra-frame representation for joint co-occurrences and the inter-frame representation for skeletons' temporal evolutions. In this paper we propose an end-to-end convolutional co-occurrence feature learning framework. The co-occurrence features are learned with a hierarchical methodology, in which different levels of contextual information are aggregated gradually. Firstly point-level information of each joint is encoded independently. Then they are assembled into semantic representation in both spatial and temporal domains. Specifically, we introduce a global spatial aggregation scheme, which is able to learn superior joint co-occurrence features over local aggregation. Besides, raw skeleton coordinates as well as their temporal difference are integrated with a two-stream paradigm. Experiments show that our approach consistently outperforms other state-of-the-arts on action recognition and detection benchmarks like NTU RGB+D, SBU Kinect Interaction and PKU-MMD.

Authors (4)

Chao Li (429 papers)
Qiaoyong Zhong (12 papers)
Di Xie (57 papers)
Shiliang Pu (106 papers)

Citations (513)

View on Semantic Scholar

Summary

The paper presents a hierarchical CNN model that learns global co-occurrence features by repositioning joint dimensions as channels.
It incorporates a two-stream paradigm to explicitly model temporal dynamics using raw skeleton coordinates and their differences.
Experimental results on benchmark datasets demonstrate significant accuracy improvements over RNN/LSTM-based methods in action recognition and detection.

Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation

The paper presents a novel framework for skeleton-based human action recognition and detection, employing a convolutional neural network (CNN) model to leverage the co-occurrence features inherent in skeletal data. The authors propose a hierarchical aggregation approach that enhances both intra-frame and inter-frame representations, targeting improved performance on tasks such as intelligent surveillance, human-computer interaction, and robotics.

Methodology

The core contribution lies in a CNN-based framework that learns hierarchical co-occurrence features from skeleton sequences. This approach contrasts with traditional methods utilizing RNNs or LSTMs that struggled with direct high-level feature extraction. The proposed method focuses on two main aspects:

Global Co-occurrence Feature Learning: Unlike other CNN-based methods that aggregate local co-occurrence features, this framework repositions joint dimensions into channels, allowing CNN layers to capture extensive joint interactions aggregatively. This is crucial for recognizing actions that involve complicated joint interactions.
Temporal Motion Encoding: By introducing a two-stream paradigm, the framework explicitly incorporates temporal joint movements using raw skeleton coordinates and their temporal differences. This explicit modeling is argued to enhance recognition performance by integrating dynamic motion aspects.
Hierarchical Aggregation: The network architecture, termed the Hierarchical Co-occurrence Network (HCN), is designed to gradually build from point-level features to sophisticated co-occurrence features, facilitating the CNN’s ability to encode spatial-temporal dynamics effectively.
Multi-Person Scalability: The framework offers strategies to handle multi-person scenarios, employing late fusion techniques such as element-wise maximum operations to integrate features from multiple subjects efficiently. This flexibility extends the framework’s applicability to interactive human actions.

Experimental Results

The HCN framework was evaluated on several benchmark datasets, including NTU RGB+D, SBU Kinect Interaction, and PKU-MMD, consistently outperforming existing state-of-the-art methods. Notably:

On NTU RGB+D, the proposed method achieved significant improvements, with accuracy enhancements of 7.3% in cross-subject settings when compared to LSTM-based approaches.
On SBU Kinect, the HCN delivered a substantial 8.2% increase over RNN-based methods, demonstrating remarkable efficiency in smaller datasets.
In temporal action detection on PKU-MMD, the framework exhibited increased performance in mean average precision (mAP), underscoring its generalization ability across tasks.

Implications and Future Work

The proposed HCN framework exemplifies a robust approach to skeleton-based action recognition and detection, offering key insights into the value of global co-occurrence feature learning. It suggests that CNNs, when designed to encompass joint interactions comprehensively, can significantly enhance the model’s recognition capabilities.

The hierarchical and explicit aggregation techniques open avenues for further exploration in adapting CNNs to inherently sequential tasks like action recognition. Future directions could involve integrating this framework with multi-modal data or extending it to more complex multi-person interactions to further improve adaptability and comprehensiveness in action modeling.

In summary, the presented work underscores the viability of using CNNs for effective skeleton-based action recognition, challenging conventional reliance on sequential models and setting a new precedent for future research in this domain.

PDF Markdown