Multimodal Deep Learning for Robust RGB-D Object Recognition (1507.06821v2)

Published 24 Jul 2015 in cs.CV, cs.LG, cs.NE, and cs.RO

Abstract: Robust object recognition is a crucial ingredient of many, if not all, real-world robotics applications. This paper leverages recent progress on Convolutional Neural Networks (CNNs) and proposes a novel RGB-D architecture for object recognition. Our architecture is composed of two separate CNN processing streams - one for each modality - which are consecutively combined with a late fusion network. We focus on learning with imperfect sensor data, a typical problem in real-world robotics tasks. For accurate learning, we introduce a multi-stage training methodology and two crucial ingredients for handling depth data with CNNs. The first, an effective encoding of depth information for CNNs that enables learning without the need for large depth datasets. The second, a data augmentation scheme for robust learning with depth images by corrupting them with realistic noise patterns. We present state-of-the-art results on the RGB-D object dataset and show recognition in challenging RGB-D real-world noisy settings.

View on arXiv

Authors (5)

Andreas Eitel (9 papers)
Jost Tobias Springenberg (48 papers)
Luciano Spinello (3 papers)
Martin Riedmiller (64 papers)
Wolfram Burgard (149 papers)

Citations (634)

View on Semantic Scholar

Summary

Multimodal Deep Learning for Robust RGB-D Object Recognition

The paper presents an advanced architecture in object recognition utilizing RGB-D data, focusing on improving performance despite imperfect sensor inputs—an inherent challenge in real-world robotic applications. The researchers propose a dual-stream Convolutional Neural Network (CNN) framework, where each stream processes a different modality, either RGB or depth data. These streams are then combined using a late fusion strategy, enabling the network to learn to effectively integrate the distinct features each modality offers.

Architecture and Methodology

The architecture leverages the capabilities of CNNs by extending them from traditional RGB image recognition to include depth information. The authors employ a multi-stage training strategy, initially training each modality individually, followed by a joint fine-tuning phase. This approach exploits pre-trained networks on ImageNet, facilitating the use of CNNs despite limited depth data availability.

Key contributions include:

Depth Data Encoding: This novel encoding method renders depth images as pseudo-RGB images, allowing the use of pre-trained CNNs initially designed for RGB data. This method offers a computationally efficient alternative to more complex techniques like HHA encoding.
Data Augmentation: The paper introduces a scheme to enhance depth image robustness by simulating real-world sensor noise patterns during training. This improves the model’s performance in handling occlusions and sensor inaccuracies.

Experimental Evaluation

Experimental validation was conducted using the Washington RGB-D Object Dataset, achieving a notable accuracy of 91.3%, outperforming existing methods. This result underscores the efficacy of fusing RGB and depth data, combined with depth-specific augmentations and encodings.

In addition, the authors evaluated their method on the RGB-D Scenes dataset, focusing on the robustness of their approach to noisy, real-world environments. The results highlighted an improvement in recognition accuracy when employing the proposed data augmentation methods, particularly enhancing the classification of partially occluded or noisy objects.

Implications and Future Directions

Practically, this research holds significant potential for robotics, especially in environments where reliable object recognition is crucial despite challenging sensory conditions. Theoretical implications include advancing multimodal learning, demonstrating that intelligently combining sensory inputs can yield superior results.

Future developments could further refine depth-data processing, explore larger and more diverse datasets, and investigate the integration of additional sensory modalities. These efforts would enhance the robustness and reliability of autonomous systems in dynamic and unpredictable environments.

Overall, this paper contributes a significant step forward in multimodal deep learning, addressing key challenges in RGB-D object recognition through innovative architectural and training methodologies.

PDF Markdown

Related Papers

Find Related Papers