NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis (1604.02808v1)

Published 11 Apr 2016 in cs.CV

Abstract: Recent approaches in depth-based human activity analysis achieved outstanding performance and proved the effectiveness of 3D representation for classification of action classes. Currently available depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of training samples, distinct class labels, camera views and variety of subjects. In this paper we introduce a large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects. Our dataset contains 60 different action classes including daily, mutual, and health-related actions. In addition, we propose a new recurrent neural network structure to model the long-term temporal correlation of the features for each body part, and utilize them for better action classification. Experimental results show the advantages of applying deep learning methods over state-of-the-art hand-crafted features on the suggested cross-subject and cross-view evaluation criteria for our dataset. The introduction of this large scale dataset will enable the community to apply, develop and adapt various data-hungry learning techniques for the task of depth-based and RGB+D-based human activity analysis.

PDF Abstract

NTU RGB+D: A Large-Scale Dataset for 3D Human Activity Analysis

The paper "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis" by Shahroudy et al. presents a significant contribution to the domain of 3D human activity recognition. The authors introduce the NTU RGB+D dataset, which is considerably larger and more varied than existing datasets, designed to drive advancements in depth-based human activity analysis.

Dataset Overview

The NTU RGB+D dataset comprises 56,880 video samples, encompassing 4 million frames collected from 40 distinct subjects. The data is acquired using Microsoft Kinect v2 sensors across 80 camera viewpoints, providing RGB videos, depth sequences, 3D skeletal data, and infrared frames. The included 60 action classes span daily activities, mutual interactions, and health-related actions. This diverse action set, along with the wide age range of participants and various recording setups, ensures a high degree of intra-class and inter-class variability.

Addressing Existing Limitations

Current RGB+D datasets have limitations such as small sample sizes, restricted camera views, and limited subject diversity. These constraints hinder the development and evaluation of advanced, data-hungry algorithms. NTU RGB+D addresses these by:

Scale: Significantly increasing the number of samples and action classes.
Subject Diversity: Capturing performances from 40 subjects aged between 10 and 35 years.
Camera Views: Utilizing three cameras at varied angles and heights, ensuring comprehensive view coverage.
Data Modalities: Providing synchronized and aligned RGB, depth, infrared, and skeletal data.

Experimental Methodology

The authors benchmark several state-of-the-art approaches on the NTU RGB+D dataset:

Depth-Map Features: HOG $^2$ , Super Normal Vector, and HON4D are employed to extract features directly from depth maps.
Skeleton-Based Features: Methods like Lie Group, Skeletal Quads, and FTP Dynamic Skeletons utilize the provided 3D skeletal data.
Recurrent Neural Networks (RNNs): RNNs and LSTMs are integrated, given their strong suit in sequence learning.

Part-Aware LSTM Network

A notable methodological contribution is the introduction of the Part-Aware LSTM (P-LSTM). This model leverages the physical structure of the human body by separating the long-term memory cell into sub-cells corresponding to different body parts (e.g., torso, hands, legs). This architecture allows each part’s dynamics to be learned more effectively while reducing the parameter space, thereby counteracting overfitting—a common issue in large-scale data-driven models.

Results and Implications

The experimental results clearly demonstrate the efficacy of data-driven learning methods on the NTU RGB+D dataset. For cross-subject and cross-view evaluations, the proposed P-LSTM outperforms traditional methods and baseline RNNs and LSTMs, achieving accuracy rates of 62.93% and 70.27%, respectively. This superior performance underscores the potential of NTU RGB+D to facilitate the development of robust human activity recognition systems.

Future Directions

The NTU RGB+D dataset opens several avenues for future research:

Enhanced Models: Exploring advanced deep learning architectures and attention mechanisms to further improve action recognition accuracy.
Multimodal Fusion: Investigating the fusion of RGB, depth, and infrared data to exploit complementary information across modalities.
Real-World Applications: Applying developed methods to real-world scenarios, including healthcare, surveillance, and human-computer interaction.

In conclusion, the NTU RGB+D dataset is a valuable resource designed to push the boundaries of human activity recognition research. By offering an extensive and varied dataset, it addresses the limitations of existing benchmarks and provides a solid foundation for the development and evaluation of advanced 3D action analysis methods. The introduction of the P-LSTM model further exemplifies innovative approaches that can be explored thanks to this rich dataset.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Amir Shahroudy (9 papers)
Jun Liu (606 papers)
Tian-Tsong Ng (7 papers)
Gang Wang (407 papers)

Citations (2,321)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos