Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition (2008.00188v4)

Published 1 Aug 2020 in cs.CV

Abstract: Action recognition via 3D skeleton data is an emerging important topic in these years. Most existing methods either extract hand-crafted descriptors or learn action representations by supervised learning paradigms that require massive labeled data. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL that can leverage different augmentations of unlabeled skeleton data to learn action representations in an unsupervised manner. Specifically, we first propose to contrast similarity between augmented instances (query and key) of the input skeleton sequence, which are transformed by multiple novel augmentation strategies, to learn inherent action patterns ("pattern-invariance") of different skeleton transformations. Second, to encourage learning the pattern-invariance with more consistent action representations, we propose a momentum LSTM, which is implemented as the momentum-based moving average of LSTM based query encoder, to encode long-term action dynamics of the key sequence. Third, we introduce a queue to store the encoded keys, which allows our model to flexibly reuse proceeding keys and build a more consistent dictionary to improve contrastive learning. Last, by temporally averaging the hidden states of action learned by the query encoder, a novel representation named Contrastive Action Encoding (CAE) is proposed to represent human's action effectively. Extensive experiments show that our approach typically improves existing hand-crafted methods by 10-50% top-1 accuracy, and it can achieve comparable or even superior performance to numerous supervised learning methods.

Authors (5)

Haocong Rao (14 papers)
Shihao Xu (7 papers)
Xiping Hu (46 papers)
Jun Cheng (108 papers)
Bin Hu (217 papers)

Citations (166)

View on Semantic Scholar

Summary

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

The paper "Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition" introduces an innovative approach to action recognition using 3D skeleton data by leveraging novel concepts in unsupervised learning paradigms. This work addresses the challenges prevalent in skeleton-based action recognition, notably the heavy reliance on labeled datasets and the difficulty in crafting effective action features.

Overview of the Approach

This research proposes an unsupervised learning framework named Augmented Skeleton Based Contrastive Action Learning (AS-CAL), which is designed to learn action representations from unlabeled skeleton data. The methodology is pivoted on a contrastive learning paradigm that maximizes similarity between augmented instances of skeleton sequences, implemented through momentum Long Short-Term Memory (mLSTM) networks. The core innovations lie in the use of multiple skeleton augmentation techniques, a momentum-based encoder system, and a queue-based dictionary for constructing a contrastive loss.

Technical Components

Data Augmentation Strategies: The research introduces a set of data augmentation methods tailored for skeleton sequences. These include random rotation, shear, temporal inversion, Gaussian blur and noise, joint masking, and channel masking. Each strategy aims to transform the input skeleton sequence while preserving inherent "pattern-invariance," crucial for learning effective representations.
Momentum LSTM (mLSTM): The paper proposes the mLSTM as an enhancement to traditional LSTM networks by employing a momentum-based parameter update. This method ensures a more consistent encoding of action dynamics across various skeleton transformations, crucial for reducing variance in model training and maintaining robust performance in contrastive learning tasks.
Queue-Based Dictionary: To facilitate efficient contrastive learning, the research adopts a queue-based dictionary to store encoded keys. This technique helps maintain a larger and more consistent dictionary, thereby increasing the availability of negative samples needed for effective contrastive differentiation between similar and dissimilar skeleton sequences.
Contrastive Loss Function: The contrastive learning is driven by the Noise Contrastive Estimation (NCE)-based loss function, focusing on differentiating positive and negative pairs of action representations. The utilization of dot product operations as similarity metrics provides a continuous measure that aids in discriminative action feature learning.

Empirical Results

The empirical evaluations across multiple datasets—NTU RGB+D 60, NTU RGB+D 120, SBU Kinect Interaction, and UWA3D—indicate that the proposed method significantly outperforms conventional hand-crafted methods and shows competitive results against supervised approaches. Specifically, AS-CAL demonstrates improvements in Top-1 accuracy of up to 50% compared to baseline methods, underscoring its practical viability in scenarios where labeled data may be sparse or costly.

Implications and Future Work

This paper paves the way for future research in unsupervised skeleton-based action recognition, where effective learning without labels is crucial. The augmentations devised could inspire more comprehensive transformations applicable to other pattern recognition areas. Additionally, the integration of pretext tasks into this framework may further boost the understanding and abstraction of action semantics. Exploring more efficient encoder designs, such as graph convolutional networks (GCNs), offers potential pathways for capturing nuanced spatial-temporal features, enhancing model adaptability and performance across diverse applications.

In conclusion, "Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition" contributes a novel perspective on how unsupervised learning can be structured to discern complex actions from minimal information, expanding the frontier of machine perception in human activity recognition systems.

Related Papers

Find Related Papers