Human Action Recognition from Various Data Modalities: A Review (2012.11866v5)

Published 22 Dec 2020 in cs.CV

Abstract: Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this paper, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.

Authors (6)

Zehua Sun (4 papers)
Qiuhong Ke (42 papers)
Hossein Rahmani (44 papers)
Mohammed Bennamoun (124 papers)
Gang Wang (407 papers)
Jun Liu (606 papers)

Citations (410)

View on Semantic Scholar

Summary

The paper provides a comprehensive review of deep learning methods for human action recognition using diverse visual and non-visual data modalities.
It examines the evolution from classical 2D CNNs and RNNs to advanced GNNs and Transformer-based models for effective spatio-temporal feature extraction.
It emphasizes multi-modal fusion techniques that enhance recognition accuracy and outlines future research directions including unsupervised and few-shot learning.

Human Action Recognition from Various Data Modalities: An Overview

The paper "Human Action Recognition from Various Data Modalities: A Review" presents a detailed survey of the field of Human Action Recognition (HAR) leveraging different data modalities through deep learning approaches. HAR is a crucial component of computer vision, with applications ranging from video surveillance and autonomous vehicles to interactive systems and entertainment.

Data Modalities in HAR

The paper categorizes various data modalities used in HAR into visual and non-visual types. Visual modalities include RGB videos, skeleton sequences, depth maps, infrared videos, point cloud data, and event streams, each providing distinct advantages based on the information encapsulated and the specific application scenarios. Non-visual modalities, described in the paper, such as audio signals, acceleration data, radar signatures, and WiFi signals, offer alternative or complementary perspectives, particularly when privacy concerns or environmental conditions limit visual modality efficacy.

Deep Learning Approaches

The review highlights major advances in deep learning methods for each modality:

RGB Modality: Deep learning frameworks have evolved from classical two-stream 2D CNNs, combining spatial (RGB frames) and temporal (optical flow) features, to more sophisticated models incorporating RNNs and 3D CNN architectures to exploit spatio-temporal dynamics better. Recent techniques also include Transformer-based architectures that enhance long-range temporal feature modeling in videos.
Skeleton Modality: GNNs and GCNs are particularly noted for their ability to represent the graph-like nature of human joints and their connections. RNN-based and CNN-based models have been adapted to encode spatio-temporal information effectively.
Depth Modality: Methods in this area demonstrate the utility of 3D convolutional nets and dynamic image representations to extract meaningful features from depth sequences, further facilitated by fusing these with skeleton data.
Point Cloud and Event Stream Modalities: These have seen the adoption of architectures that can handle sparse and rich 3D data, with networks like PointNet and spiking neural networks (SNNs) becoming prominent.
Infrared, Audio, Radar, and WiFi Modalities: These modalities are less dominant but serve essential niche applications. The surveyed approaches reflect innovative adaptations for thermal data, signal processing, and through-wall sensing in HAR.

Multi-Modality and Fusion

A significant focus of the paper is on multi-modality approaches, in which the fusion of different data types is shown to enhance robustness and accuracy in HAR tasks. Fusion methods — both at the feature level and decision level — exploit complementary information among modalities. Moreover, co-learning and cross-modal training approaches facilitate knowledge transfer between modalities, which can mitigate data scarcity issues.

Implications and Future Directions

This comprehensive review outlines several implications for HAR research. Developing large, diverse datasets remains a priority, particularly to train networks capable of generalizing across multiple environments and conditions. The challenge of efficient HAR is addressed with recommendations for models that maintain performance without being computationally prohibitive. Additionally, the paper suggests directions such as few-shot learning, unsupervised methods, and self-supervised learning, emphasizing the importance of reducing data labeling efforts while still advancing HAR capabilities.

In summary, the field of HAR continues to advance rapidly, with significant contributions from exploiting a wide range of data modalities and the integration of cutting-edge deep learning methodologies. The highlighted challenges and proposed future research directions provide a roadmap to address current limitations and fortify HAR applications in real-world scenarios.

PDF Markdown