RGB-D-based Human Motion Recognition with Deep Learning: A Survey (1711.08362v2)

Published 31 Oct 2017 in cs.CV

Abstract: Human motion recognition is one of the most important branches of human-centered research activities. In recent years, motion recognition based on RGB-D data has attracted much attention. Along with the development in artificial intelligence, deep learning techniques have gained remarkable success in computer vision. In particular, convolutional neural networks (CNN) have achieved great success for image-based tasks, and recurrent neural networks (RNN) are renowned for sequence-based problems. Specifically, deep learning methods based on the CNN and RNN architectures have been adopted for motion recognition using RGB-D data. In this paper, a detailed overview of recent advances in RGB-D-based motion recognition is presented. The reviewed methods are broadly categorized into four groups, depending on the modality adopted for recognition: RGB-based, depth-based, skeleton-based and RGB+D-based. As a survey focused on the application of deep learning to RGB-D-based motion recognition, we explicitly discuss the advantages and limitations of existing techniques. Particularly, we highlighted the methods of encoding spatial-temporal-structural information inherent in video sequence, and discuss potential directions for future research.

PDF Abstract

Overview of "RGB-D-based Human Motion Recognition with Deep Learning: A Survey"

The paper "RGB-D-based Human Motion Recognition with Deep Learning: A Survey" provides an extensive overview of the developments in human motion recognition using RGB-D data, with a particular focus on deep learning methods. Authored by Wang et al., the paper categorizes the existing techniques into four primary groups based on data modalities: RGB-based, depth-based, skeleton-based, and combined RGB+D-based recognition. It discusses the respective strengths and challenges of these methodologies while considering recent breakthroughs and ongoing limitations.

Summary of the Paper

Core Focus

The survey underscores human motion recognition, facilitated by advancing RGB-D sensors like the Microsoft Kinect and Asus Xtion. These sensors offer depth data which, alongside RGB data, enriches scene understanding without being sensitive to lighting changes, making them advantageous for motion recognition tasks.

Methodological Advances

Categorization: The approaches are categorized based on the data modalities they utilize:
- RGB-based methodologies leverage traditional image features and have been enhanced by CNNs, which excel at feature extraction.
- Depth-based approaches make use of depth maps to derive structural information from scenes, often using CNNs to handle view variances.
- Skeleton-based techniques focus on human joint positions as high-level features. These approaches have harnessed RNNs to model temporal dependencies effectively.
- Combined RGB+D-based methods integrate multi-modal data, attempting to utilize the advantages of all modalities for improved action recognition.
Deep Learning Architectures: The paper reviews various deep learning architectures including CNNs, RNNs, and hybrid networks, which are pivotal in capturing spatio-temporal dynamics and structural details essential for human motion recognition.
Challenges: Despite progress, the paper acknowledges challenges such as insufficient labeled data, the need for view-invariance, handling occlusions, and temporal dynamics.

Key Numerical Results & Claims

The paper emphasizes that recent methods leveraging deep learning have achieved remarkable accuracy improvements on several datasets, demonstrating the capacity of such models to generalize in human motion recognition tasks. For instance, CNNs and hybrid models have shown superior accuracy on challenging datasets like NTU RGB+D. However, there is still room for improvement, particularly in dealing with large-scale continuous and online datasets.

Implications and Future Directions

Practically, the advancements in motion recognition are driving critical applications such as surveillance systems, human-robot interaction, and smart environments. Theoretically, the integration of different modalities, addressing the small dataset challenge, and improving occlusion handling are primary areas for future research.

The survey highlights several promising research avenues:

Hybrid Networks: Combining CNNs and RNNs more effectively can unlock richer, multi-faceted analysis of spatio-temporal dynamics.
Multimodal Fusion: Effective fusion techniques could better leverage the complementary nature of RGB, depth, and skeleton data.
Robust Learning: Developing algorithms that require minimal data for training or can learn incrementally from new data streams could revolutionize practical applications.

Conclusion

The paper by Wang et al. is an insightful resource for researchers in the domain of computer vision, focusing on human motion recognition using deep learning with RGB-D data. The survey's exhaustive treatment of methodologies, insights into associated challenges, and speculation on future work lay a groundwork that could guide research to foster innovative solutions in this thriving area of paper.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Pichao Wang (65 papers)
Wanqing Li (53 papers)
Philip Ogunbona (19 papers)
Jun Wan (79 papers)
Sergio Escalera (127 papers)

Citations (339)

View on Semantic Scholar