- The paper introduces a novel bi-directional GRU encoder-decoder that uses fixed decoder strategies to enhance feature extraction for action recognition.
- It demonstrates superior performance over existing unsupervised and even some supervised methods on NW-UCLA, UWA3D, and NTU RGB+D datasets, especially under cross-view tests.
- The approach offers a scalable solution for recognizing human actions using only skeleton keypoints, reducing the complexity of RGB+D data collection.
Overview of Unsupervised Skeleton-Based Action Recognition System
The paper introduces PREDICT CLUSTER (P{content}C), a novel system for unsupervised skeleton-based human action recognition that leverages an encoder-decoder recurrent neural network (RNN) architecture. The system is particularly notable for eschewing the need for labeled data, utilizing only body keypoints as inputs, which significantly reduces the complexity typically involved in action recognition setups that require RGB+D data.
System Architecture and Training Methodology
The core innovation in this work is a bi-directional GRU-based encoder-decoder model. The encoder processes body joint sequences and aggregates the learned representation in its final hidden state. The decoder, tasked with regenerating the input, feeds directly off this state. A significant advancement here lies in the training strategy: instead of refining the predictive capabilities of the decoder, the authors propose methods that intentionally weaken it—specifically, the Fixed Weights and Fixed States strategies—forcing enhanced learning in the encoder's feature representations.
These strategies guide the system toward efficient unsupervised clustering of actions by penalizing the decoder and optimizing the encoder's final state representation. This is complemented by a feature-level auto-encoder which compresses the high-dimensional feature vectors, improving class separability for subsequent use with a K-nearest neighbors classifier.
Experimental Results
The P{content}C system is evaluated across three extensive skeleton-based datasets—NW-UCLA, UWA3D, and NTU RGB+D, featuring a substantial diversity in subject count and action classes. The methodology achieved promising results, outperforming existing unsupervised skeleton-based methods and unsupervised RGB+D approaches, especially under cross-view test conditions. Moreover, it competes robustly against several supervised skeleton-based models, as evidenced in large-scale datasets like NTU RGB+D.
Implications and Future Directions
The implications of P{content}C are multifaceted. Practically, it offers a scalable solution to action recognition challenges where labeled datasets are impractical to curate. Theoretically, its architecture and training methodologies suggest novel pathways for enhancing feature representation learning in unsupervised settings.
Looking forward, there is potential to further unpack and refine the underlying mechanisms of self-organization in RNNs, possibly extending applications beyond skeleton-based action recognition to various sequence-based classification tasks. Continued exploration in this direction could lead to more generalized unsupervised learning frameworks capable of operating with minimal, or even no, pre-condition inputs like labels or auxiliary data modalities.