Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos (1603.07120v2)

Published 23 Mar 2016 in cs.CV

Abstract: Single modality action recognition on RGB or depth sequences has been extensively explored recently. It is generally accepted that each of these two modalities has different strengths and limitations for the task of action recognition. Therefore, analysis of the RGB+D videos can help us to better study the complementary properties of these two types of modalities and achieve higher levels of performance. In this paper, we propose a new deep autoencoder based shared-specific feature factorization network to separate input multimodal signals into a hierarchy of components. Further, based on the structure of the features, a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance. Our experimental results show the effectiveness of our cross-modality feature analysis framework by achieving state-of-the-art accuracy for action classification on five challenging benchmark datasets.

Authors (4)

Amir Shahroudy (9 papers)
Tian-Tsong Ng (7 papers)
Yihong Gong (38 papers)
Gang Wang (407 papers)

Citations (225)

View on Semantic Scholar

Summary

Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos

The paper, "Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos," presents a novel method for enhancing the accuracy of action recognition tasks by leveraging the complementary strengths of RGB and depth (D) modalities. Recent advances in action recognition have predominantly focused on either RGB or depth sequences; however, each modality showcases distinct advantages and limitations. For instance, RGB sequences effectively capture texture and appearance details, whereas depth sequences provide insight into 3D structural information. This paper proposes a comprehensive approach that integrates both modalities to capitalize on their respective strengths.

The core of the proposed methodology is a deep autoencoder-based shared-specific feature factorization network, which transforms RGB+D inputs into a hierarchy of shared and modality-specific components. This network systematically disentangles the mixed-modal signals, thereby revealing cross-modal components that are pivotal for improved action classification performance. The authors introduce structured sparsity learning, which employs mixed norms for refined regularization, facilitating better classification by leveraging both shared and distinct modality components.

In building this framework, the authors move beyond traditional linear techniques like Canonical Correlation Analysis (CCA) and its variants that are constrained by linear modeling limitations. By incorporating non-linearities through a deep network, the proposed system effectively captures complex multimodal feature correlations that linear approaches struggle with. Moreover, the paper demonstrates that stacking layers of non-linear shared-specific analysis progressively extracts higher-level abstract representations, which enhance action recognition accuracy on five benchmark datasets.

Experimental results are prominently showcased, with the framework consistently achieving state-of-the-art performance across various challenging test scenarios. Such high precision, particularly on datasets like MSR-DailyActivity3D and NTU RGB+D, underline the effectiveness of the framework. The deep network’s hierarchical feature extraction is complemented by a structured sparsity-based learning machine, which fosters robust classification outcomes by intelligently selecting and weighting components and layers.

The implications of this paper are manifold. Practically, enhanced action recognition can be deployed in diverse applications, such as intelligent surveillance systems, human-computer interaction, and healthcare monitoring. Theoretically, it underscores the significance of joint modal analysis and the potential of deep networks in handling non-linear, high-dimensional multimodal signal complexities. Future developments might explore extending this framework by incorporating additional modalities, enhancing its applicability in real-world scenarios with more intricate human action dynamics. Moreover, further performance gains might be achieved by integrating advances in real-time processing and unsupervised learning approaches, thus expanding the horizons of AI-driven action recognition.

PDF Markdown

Related Papers

Find Related Papers