Learning Rich Features from RGB-D Images for Object Detection and Segmentation (1407.5736v1)

Published 22 Jul 2014 in cs.CV and cs.RO

Abstract: In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

Citations (1,536)

View on Semantic Scholar

Summary

The paper presents a geocentric encoding method that captures geometric features like height, angle, and disparity from RGB-D images to enhance detection and segmentation.
It integrates these features into a refined R-CNN framework, achieving a 56% relative improvement in object detection precision on the NYUD2 dataset.
Using augmented training data including synthetic samples, the approach significantly improves pixel-level instance and semantic segmentation in complex scenes.

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

The paper "Learning Rich Features from RGB-D Images for Object Detection and Segmentation" by Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik presents an in-depth examination of object detection and instance segmentation using RGB-D images. This research aims to enhance the understanding and utilization of 3D depth data combined with 2D RGB images for more effective object detection and segmentation in cluttered environments.

Methodology

The proposed approach is a significant advancement over existing methods that treat depth information as an additional channel akin to color channels. Instead, the authors introduce a geocentric embedding for depth images that encodes multiple geometric properties such as height above the ground and angle with the gravity vector, along with horizontal disparity.

The authors generalize the R-CNN architecture for RGB-D data, leveraging the added depth information to create more robust feature representations for object detection. The paper includes several critical components:

RGB-D Contour Detection and 2.5D Region Proposals: This module computes depth and normal gradients, leveraging these in a structured learning framework to generate improved contours. The contours are used to produce 2.5D region candidates, significantly enhancing the proposal generation step.
Geocentric Encoding: Depth images are encoded into three unique channels: horizontal disparity, height above ground, and angle with gravity. These channels align with geocentric poses, which, unlike raw depth data, provide semantic context that the convolutional neural network (CNN) can learn more efficiently.
Training and Data Augmentation: The CNN architecture, pre-trained on RGB images, was finetuned using the NYUD2 dataset. The researchers augmented the limited NYUD2 training data with synthetic data derived from 3D scene annotations, demonstrating that such augmentation can help overcome the dataset's limited size.
Instance Segmentation: Building upon object detections, the system generates pixel-level segmentation masks, framing it as a foreground-background labeling task within a decision forest framework. The results are then refined using superpixels to produce final segmentation masks.
Semantic Segmentation: The results from object detection are incorporated into an existing superpixel classification framework, enhancing the performance of labeling all pixels within the scene.

Experimental Results

The evaluation on the NYUD2 dataset showed marked improvements over baseline methods:

Object Detection: The proposed system achieved an average precision (AP) of 37.3%, reflecting a 56% relative improvement from prior state-of-the-art methods. This was attributed to the robust feature representation learned from the geocentric embedding.
Instance Segmentation: The introduction of instance segmentation capabilities achieved a region detection average precision (regionAP) of 32.1%, outperforming baseline methods that either used rectangular bounding boxes or simplistic foreground masks. The method's strength lay in its ability to correct for spatial mis-localizations, which are common in bounding-box based approaches.
Semantic Segmentation: A frequency-weighted average accuracy of 47% was obtained, with a 24% relative improvement in mean accuracy over certain object categories benefited from the enhanced object detection features.

Implications and Future Directions

The results from this paper have significant implications both practically and theoretically for fields relying on computer vision, such as robotics and autonomous systems. Practically, the refined feature representations from RGB-D data can lead to better robotic perception in environments where understanding precise object dimensions and orientations is crucial, such as in grasping and manipulation tasks.

Theoretically, the notion of geocentric encoding opens new avenues for research into more sophisticated methods of integrating geometric properties into visual learning frameworks. Future developments could explore incorporating more complex scene geometries into learning algorithms, refining synthetic data generation techniques to produce more realistic training datasets, and further enhancing multi-modal data fusion in CNN architectures.

Overall, this paper presents a comprehensive and well-structured approach to enhancing RGB-D object detection and segmentation, setting a benchmark for future research in the field of 3D vision tasks.

PDF Markdown