Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge (1609.09475v3)

Published 29 Sep 2016 in cs.CV, cs.LG, and cs.RO

Abstract: Robot warehouse automation has attracted significant interest in recent years, perhaps most visibly in the Amazon Picking Challenge (APC). A fully autonomous warehouse pick-and-place system requires robust vision that reliably recognizes and locates objects amid cluttered environments, self-occlusions, sensor noise, and a large variety of objects. In this paper we present an approach that leverages multi-view RGB-D data and self-supervised, data-driven learning to overcome those difficulties. The approach was part of the MIT-Princeton Team system that took 3rd- and 4th- place in the stowing and picking tasks, respectively at APC 2016. In the proposed approach, we segment and label multiple views of a scene with a fully convolutional neural network, and then fit pre-scanned 3D object models to the resulting segmentation to get the 6D object pose. Training a deep neural network for segmentation typically requires a large amount of training data. We propose a self-supervised method to generate a large labeled dataset without tedious manual segmentation. We demonstrate that our system can reliably estimate the 6D pose of objects under a variety of scenarios. All code, data, and benchmarks are available at http://apc.cs.princeton.edu/

Authors (7)

Andy Zeng (54 papers)
Kuan-Ting Yu (8 papers)
Shuran Song (110 papers)
Daniel Suo (11 papers)
Ed Walker Jr. (1 paper)
Alberto Rodriguez (79 papers)
Jianxiong Xiao (14 papers)

Citations (446)

View on Semantic Scholar

Summary

The paper introduces a multi-view self-supervised approach that enhances 6D pose estimation for robotic picking in cluttered warehouse environments.
The method combines FCN-based segmentation with pre-scanned 3D model fitting and iterative closest point algorithms for robust performance.
Evaluations on over 7,000 benchmark images and 130,000 self-labeled samples demonstrate significant accuracy improvements in the Amazon Picking Challenge.

Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge

The paper "Multi-view Self-supervised Deep Learning for 6D Pose Estimation" presents a sophisticated approach to the challenge of robotic vision in warehouse automation. The work is positioned within the context of the Amazon Picking Challenge (APC), addressing the need for an autonomous system capable of accurately identifying and locating objects amidst complex visual environments characterized by clutter, occlusions, and sensor noise.

Core Contributions

This research leverages multi-view RGB-D data and self-supervised deep learning methodologies to enhance the accuracy and reliability of 6D pose estimation. The proposed system, part of the MIT-Princeton team's contribution, achieved commendable ranks at the APC, underscoring the efficacy of the approach.

Methodology

The core methodology involves segmenting and labeling multi-view images using a fully convolutional neural network (FCN), which is adept at image segmentation tasks. Subsequently, pre-scanned 3D object models are fit to these segmentations to discern the 6D pose.

Key aspects of the method include:

Multi-view Approach: The system captures images from multiple viewpoints, mitigating issues of occlusion and clutter prevalent in single-view systems. This multi-view strategy is crucial for handling scenarios where objects are partially visible.
Self-supervised Training: A novel aspect of this research is the self-supervised training mechanism that automatically generates a vast dataset. This approach circumvents the labor-intensive process of manual labeling, providing over 130,000 images with pixel-level labels.
Robust Pose Estimation: The use of iterative closest point (ICP) algorithms, enhanced by a coarse-to-fine approach and sophisticated initialization techniques, results in robust pose estimation even in challenging conditions.

Numerical Evaluation

The performance of the system is evaluated using a benchmark dataset containing over 7,000 images from 477 scenes, representing various challenges such as different lighting conditions, partial views, and reflective surfaces. The system demonstrates a superior ability to predict translations and rotations within acceptable bounds, reinforcing the robustness of the multi-view approach.

Implications and Future Directions

The implications of this research are significant, both practically and theoretically. Practically, the enhanced vision system offers a pathway to more reliable automation in warehouses, potentially improving efficiency in inventory management and order fulfiLLMent. Theoretically, the work highlights the advantages of integrating deep learning with classical vision techniques and points to future research directions involving larger, more diverse datasets and advanced learning paradigms.

Future developments could explore:

Enhanced Data Augmentation Techniques: Further expansion of the training dataset with more complex object categories and real-world conditions could refine the system's capabilities.
Integration with Advanced Manipulation Strategies: Expanding the system to cater to different manipulative needs like suction and grasping could make it more versatile in real-world applications.

Overall, this paper provides valuable insights into the design of vision systems for robotic picking, demonstrating how leveraging constraints inherent in the task environment can lead to significant enhancements in operation robustness and accuracy.

PDF Markdown