MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition (2010.05599v2)

Published 12 Oct 2020 in cs.CV and cs.AI

Abstract: In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting problem, and the features are not generalizable for action recognition. Instead, we propose to integrate multiple tasks to learn more general representations in a self-supervised manner. To realize this goal, we integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Skeleton dynamics can be modeled through motion prediction by predicting the future sequence. And temporal patterns, which are critical for action recognition, are learned through solving jigsaw puzzles. We further regularize the feature space by contrastive learning. Besides, we explore different training strategies to utilize the knowledge from self-supervised tasks for action recognition. We evaluate our multi-task self-supervised learning approach with action classifiers trained under different configurations, including unsupervised, semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition, demonstrating the superiority of our method in learning more discriminative and general features. Our project website is available at https://langlandslin.github.io/projects/MSL/.

Authors (4)

Lilang Lin (11 papers)
Sijie Song (8 papers)
Wenhan Yan (6 papers)
Jiaying Liu (99 papers)

Citations (172)

View on Semantic Scholar

Summary

Multi-Task Self-Supervised Learning for Skeleton-Based Action Recognition

The paper "MS $^2$ L: Multi-Task Self-Supervised Learning for Skeleton-Based Action Recognition" presents a novel approach designed to advance the proficiency of self-supervised learning methods to extract meaningful features from skeletal data for action recognition tasks. Conventional methods in this domain generally rely on learning feature representations through a singular reconstruction task, which may lead to overfitting and limited generalizability. This paper addresses these challenges through the integration of multiple self-supervised tasks to enhance the robustness and diversity of learned representations.

Methodological Overview

This research introduces a multi-task self-supervised learning framework, named MS $^2$ L, that combines motion prediction, jigsaw puzzle recognition, and contrastive learning. Each of these tasks contributes uniquely to the feature extraction process:

Motion Prediction: This task is employed to model skeleton dynamics by forecasting future skeletal sequences given past motion data. This approach leverages the temporal evolution of actions, providing a rich context for understanding movement sequences.
Jigsaw Puzzle Recognition: Here, the temporal ordering of segmented skeletal sequences is shuffled, and the task of the model is to predict the correct sequence. This jigsaw-based task aids in learning significant temporal relationships crucial for accurate action recognition.
Contrastive Learning: The paper adopts contrastive learning to better structure the feature space by distinguishing between similar and dissimilar skeletal transformations. This method involves the generation of positive and negative pairs through skeleton transformations, enforcing robust inherent representation learning.

Results and Performance Evaluation

The paper's experiments were conducted across three datasets: NW-UCLA, NTU RGB+D, and PKUMMD, using varied experimental settings such as unsupervised, semi-supervised, and fully-supervised learning environments. Key results indicate:

Unsupervised Learning: MS $^2$ L demonstrated superior performance compared to existing methods like LongT GAN, which emphasizes the benefits of a multi-task self-supervised approach in capturing detailed skeletal dynamics without labeled data.
Semi-Supervised Learning: By leveraging small subsets of labeled data, MS $^2$ L continued to outperform the baseline with notable accuracy improvements, validating the framework's capacity to utilize limited annotations effectively.
Fully-Supervised Learning: MS $^2$ L's innovative training strategies, including moving pretraining and joint training, significantly enhanced action recognition outcomes over traditional methods, further supported through detailed ablation studies emphasizing multi-task synergy.
Transfer Learning: The framework proved beneficial in transfer learning setups, attaining increased accuracy when transferred across datasets, demonstrating generalizability and robustness in handling domain shifts.

Practical and Theoretical Implications

The introduction of MS $^2$ L offers tangible improvements in performance for skeleton-based action recognition, contributing both practically to real-world applications like surveillance and theoretically to the advancement of self-supervised learning paradigms. The decomposition of action recognition into multi-faceted learning tasks provides a comprehensive model structure that could be effectively adapted to various configurations and datasets. Looking forward, the implications of applying MS $^2$ L encompass expanding its applications to other data types, enhancing the flexibility of the framework to operate with diverse modalities and actions.

In conclusion, this research promotes a paradigm shift by integrating multiple self-supervised tasks to address overfitting while enhancing feature diversity for skeleton-based action recognition systems. The framework delineated in this work offers a promising approach for future advancements in AI and self-supervised learning methodologies.

PDF Markdown

Related Papers

GitHub

acmmm2020_lll