PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition (1911.12409v1)

Published 27 Nov 2019 in cs.CV, cs.LG, and eess.IV

Abstract: We propose a novel system for unsupervised skeleton-based action recognition. Given inputs of body keypoints sequences obtained during various movements, our system associates the sequences with actions. Our system is based on an encoder-decoder recurrent neural network, where the encoder learns a separable feature representation within its hidden states formed by training the model to perform prediction task. We show that according to such unsupervised training the decoder and the encoder self-organize their hidden states into a feature space which clusters similar movements into the same cluster and distinct movements into distant clusters. Current state-of-the-art methods for action recognition are strongly supervised, i.e., rely on providing labels for training. Unsupervised methods have been proposed, however, they require camera and depth inputs (RGB+D) at each time step. In contrast, our system is fully unsupervised, does not require labels of actions at any stage, and can operate with body keypoints input only. Furthermore, the method can perform on various dimensions of body keypoints (2D or 3D) and include additional cues describing movements. We evaluate our system on three extensive action recognition benchmarks with different number of actions and examples. Our results outperform prior unsupervised skeleton-based methods, unsupervised RGB+D based methods on cross-view tests and while being unsupervised have similar performance to supervised skeleton-based action recognition.

Citations (168)

View on Semantic Scholar

Summary

The paper introduces a novel bi-directional GRU encoder-decoder that uses fixed decoder strategies to enhance feature extraction for action recognition.
It demonstrates superior performance over existing unsupervised and even some supervised methods on NW-UCLA, UWA3D, and NTU RGB+D datasets, especially under cross-view tests.
The approach offers a scalable solution for recognizing human actions using only skeleton keypoints, reducing the complexity of RGB+D data collection.

Overview of Unsupervised Skeleton-Based Action Recognition System

The paper introduces PREDICT CLUSTER (P{content}C), a novel system for unsupervised skeleton-based human action recognition that leverages an encoder-decoder recurrent neural network (RNN) architecture. The system is particularly notable for eschewing the need for labeled data, utilizing only body keypoints as inputs, which significantly reduces the complexity typically involved in action recognition setups that require RGB+D data.

System Architecture and Training Methodology

The core innovation in this work is a bi-directional GRU-based encoder-decoder model. The encoder processes body joint sequences and aggregates the learned representation in its final hidden state. The decoder, tasked with regenerating the input, feeds directly off this state. A significant advancement here lies in the training strategy: instead of refining the predictive capabilities of the decoder, the authors propose methods that intentionally weaken it—specifically, the Fixed Weights and Fixed States strategies—forcing enhanced learning in the encoder's feature representations.

These strategies guide the system toward efficient unsupervised clustering of actions by penalizing the decoder and optimizing the encoder's final state representation. This is complemented by a feature-level auto-encoder which compresses the high-dimensional feature vectors, improving class separability for subsequent use with a K-nearest neighbors classifier.

Experimental Results

The P{content}C system is evaluated across three extensive skeleton-based datasets—NW-UCLA, UWA3D, and NTU RGB+D, featuring a substantial diversity in subject count and action classes. The methodology achieved promising results, outperforming existing unsupervised skeleton-based methods and unsupervised RGB+D approaches, especially under cross-view test conditions. Moreover, it competes robustly against several supervised skeleton-based models, as evidenced in large-scale datasets like NTU RGB+D.

Implications and Future Directions

The implications of P{content}C are multifaceted. Practically, it offers a scalable solution to action recognition challenges where labeled datasets are impractical to curate. Theoretically, its architecture and training methodologies suggest novel pathways for enhancing feature representation learning in unsupervised settings.

Looking forward, there is potential to further unpack and refine the underlying mechanisms of self-organization in RNNs, possibly extending applications beyond skeleton-based action recognition to various sequence-based classification tasks. Continued exploration in this direction could lead to more generalized unsupervised learning frameworks capable of operating with minimal, or even no, pre-condition inputs like labels or auxiliary data modalities.