PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes (1711.00199v3)

Published 1 Nov 2017 in cs.CV and cs.RO

Abstract: Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.

Authors (4)

Yu Xiang (128 papers)
Tanner Schmidt (9 papers)
Venkatraman Narayanan (8 papers)
Dieter Fox (201 papers)

Citations (1,810)

View on Semantic Scholar

Summary

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Pose estimation in six degrees of freedom (6D) is a crucial capability for robots interacting with real-world environments. The paper "PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes" presents a new approach named PoseCNN, which is designed to estimate the 6D pose of objects even in cluttered and occluded scenes.

Problem Context

Traditional methods for 6D object pose estimation often struggle with texture-less objects and occlusions. These approaches typically rely on matching feature points between 3D models and 2D images, but significant limitations arise when the objects lack rich textures or are occluded. Recent advances have shifted towards leveraging deep learning techniques with RGB-D data to address some of these challenges, but the complexity in dealing with symmetries and occlusions remains significant.

PoseCNN Approach

PoseCNN introduces a Convolutional Neural Network (CNN) architecture that decouples the estimation of 3D translation and 3D rotation tasks. This decoupling allows the network to explicitly model the dependencies and independencies between these tasks. PoseCNN consists of two main stages: feature extraction through convolutional layers and task-specific branches for semantic labeling, 3D translation estimation, and 3D rotation regression.

Semantic Labeling:
- Uses pixel-wise labeling to classify each pixel into an object class.
- More informative than bounding box-based detection, especially in handling occlusions.
3D Translation Estimation:
- Localizes the 2D center of the object in the image and predicts its distance from the camera.
- Utilizes a Hough voting layer that aggregates votes from pixels towards the estimated center, providing robustness against occlusions.
3D Rotation Estimation:
- Predicts the rotation by regressing to a quaternion representation.
- Introduces two loss functions for training: PoseLoss (PLoss) and ShapeMatch-Loss (SLoss).
  - PLoss measures average squared distance between points on the model.
  - SLoss focuses on matching the 3D shape, thus effectively incorporating symmetry handling in training.

YCB-Video Dataset

The paper also contributes a new dataset for 6D object pose estimation: the YCB-Video dataset. This dataset includes 21 objects, captured in 92 videos with over 133,827 frames, providing extensive annotations of 6D poses. This large-scale dataset addresses the limitations of previous datasets by leveraging both real and synthetic images to enhance training efficacy.

Experimental Results

Evaluation Metrics

The evaluation metrics used include the average distance (ADD) and the symmetric average distance (ADD-S). These metrics are crucial for understanding the accuracy of pose estimation, especially with symmetric objects where multiple rotations might result in identical visual appearances.

Key Findings

Performance on YCB-Video Dataset:
- PoseCNN demonstrates robustness in occluded and cluttered scenes.
- Significantly outperforming a 3D object coordinate regression network, especially when evaluated on symmetric objects thanks to the SLoss function.
- The integration of Iterative Closest Point (ICP) for depth-based refinement further enhances the accuracy, showcasing the strength of PoseCNN in providing reliable initial estimates for refinement.
Performance on OccludedLINEMOD Dataset:
- Achieves state-of-the-art results on this challenging benchmark.
- Demonstrates effectiveness even when using only color images, with substantial improvements when integrating depth data and ICP refinement.

Implications and Future Directions

PoseCNN sets a new benchmark for 6D object pose estimation in complex scenes involving clutter and occlusions. The architecture's ability to decouple and efficiently handle translation and rotation while incorporating symmetry through SLoss provides significant advancements over previous methods. The availability of the YCB-Video dataset further facilitates robust training and evaluation, promoting future research in this area.

Future research directions could explore more sophisticated methods to handle symmetries, potentially integrating advanced geometric reasoning directly into the learning framework. Additionally, extending PoseCNN to operate in real-time scenarios and across diverse robotic platforms could greatly enhance practical applications in robotics and human-robot interaction.

In summary, PoseCNN represents a significant advancement towards accurate and reliable 6D object pose estimation, providing both theoretical contributions and practical tools to drive forward the capabilities of robotic systems in complex environments.

PDF Markdown