NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding (1905.04757v2)

Published 12 May 2019 in cs.CV

Abstract: Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding. [The dataset is available at: http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]

Authors (6)

Jun Liu (606 papers)
Amir Shahroudy (9 papers)
Mauricio Perez (3 papers)
Gang Wang (407 papers)
Ling-Yu Duan (36 papers)
Alex C. Kot (77 papers)

Citations (1,144)

View on Semantic Scholar

Summary

Overview of "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding"

The paper "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding" introduces an extensive dataset designed to advance research on 3D human activity recognition. This dataset, collected and organized by Jun Liu and colleagues, addresses several limitations of existing benchmarks by providing a substantial number of diverse video samples capturing a wide range of human activities recorded from multiple views and environmental conditions.

Key Contributions

Dataset Scale and Diversity:
- The NTU RGB+D 120 dataset contains 114,480 video samples captured from 106 different subjects, making it one of the largest datasets for 3D human activity recognition.
- It includes data from 120 distinct action classes, which are categorized into daily, mutual, and health-related activities. This diverse categorization aims to provide a comprehensive benchmark for evaluating various action recognition methods.
Multiple Data Modalities:
- The dataset is composed of four primary data modalities: depth maps, 3D skeletal information, RGB frames, and infrared sequences. These multi-modal data sources allow researchers to develop and evaluate diverse approaches for action recognition that may exploit different types of information.
Evaluation Protocols:
- Two standard evaluation criteria are defined: cross-subject and cross-setup evaluations. The cross-subject evaluation ensures that the training and testing sets come from different subjects, emphasizing generalization across individuals. The cross-setup evaluation uses different environmental setups for training and testing, promoting robustness to varying conditions.
Performance of Existing Methods:
- The paper evaluates several state-of-the-art 3D action recognition methods on the proposed dataset. Techniques such as Spatio-Temporal LSTM (ST-LSTM) and GCA-LSTM show varied performance, indicating the challenges posed by the dataset. The best-performing method achieves accuracies of around 64.6% (cross-subject) and 66.9% (cross-setup).
Fusion of Data Modalities:
- The research shows that combining different data modalities (e.g., RGB video, depth video, and 3D skeleton data) improves action recognition performance, achieving up to 64.0% accuracy for cross-subject evaluation and 66.1% for cross-setup.
One-Shot Learning Framework:
- A novel Action-Part Semantic Relevance-aware (APSR) framework is introduced, which utilizes the semantic relevance between body parts and action classes for one-shot learning. This framework outperforms existing methods for one-shot 3D action recognition tasks, signifying its potential for recognizing new action classes with minimal training data.

Implications and Future Directions

The introduction of the NTU RGB+D 120 dataset has several important implications for the field of human activity recognition:

Benchmark for Deep Learning Models:
- The dataset's scale and diversity make it a critical benchmark for evaluating deep learning models. Researchers can utilize it for pre-training models, which may generalize better when applied to other, smaller datasets or specific applications.
Enabling Advanced Techniques:
- The comprehensive dataset facilitates the exploration and development of advanced techniques, such as multi-modal data fusion, cross-modal transfer learning, and early action recognition. These areas represent significant opportunities for future research.
Encouraging Robustness:
- The diverse conditions under which the data is collected, including various backgrounds and lighting conditions, push developers to create more robust and generalizable models. This robustness is crucial for real-world applications where environments are rarely controlled.
Improving One-Shot Learning:
- The research’s introduction of the APSR framework showcases innovative approaches to one-shot learning, an essential capability for systems that must adapt to new actions with limited data. This line of research has the potential to greatly enhance the versatility of human activity recognition systems.

Conclusion

The NTU RGB+D 120 dataset represents a significant advancement in the resources available for 3D human activity recognition research. By addressing the limitations of previous datasets through its scale, diversity, and the inclusion of multiple data modalities, it provides a robust benchmark for developing and evaluating sophisticated action recognition methods. The proposed APSR framework further highlights the potential for innovation in one-shot learning, paving the way for more adaptive and efficient machine learning models. As researchers continue to leverage this dataset, there is a strong potential for notable advancements in the theoretical and practical aspects of artificial intelligence and human activity recognition.

PDF Markdown

Related Papers

Find Related Papers