Implicit 3D Orientation Learning for 6D Object Detection from RGB Images (1902.01275v2)

Published 4 Feb 2019 in cs.CV

Abstract: We propose a real-time RGB-based pipeline for object detection and 6D pose estimation. Our novel 3D orientation estimation is based on a variant of the Denoising Autoencoder that is trained on simulated views of a 3D model using Domain Randomization. This so-called Augmented Autoencoder has several advantages over existing methods: It does not require real, pose-annotated training data, generalizes to various test sensors and inherently handles object and view symmetries. Instead of learning an explicit mapping from input images to object poses, it provides an implicit representation of object orientations defined by samples in a latent space. Our pipeline achieves state-of-the-art performance on the T-LESS dataset both in the RGB and RGB-D domain. We also evaluate on the LineMOD dataset where we can compete with other synthetically trained approaches. We further increase performance by correcting 3D orientation estimates to account for perspective errors when the object deviates from the image center and show extended results.

Citations (539)

View on Semantic Scholar

Summary

The paper introduces a novel Augmented Autoencoder that implicitly learns 3D orientations from synthetic data for efficient, annotation-free 6D object detection.
It demonstrates state-of-the-art performance on the T-LESS dataset and competitive results on LineMOD, enhancing robustness against occlusion and sensor variation.
The method leverages domain randomization and denoising reconstruction to enable rapid training and strong generalization in diverse real-world scenarios.

Insights into Augmented Autoencoders for 6D Object Detection

This paper presents a novel approach for real-time 6D object detection leveraging Augmented Autoencoders (AAE). The method focuses on RGB-based object detection and 6D pose estimation, emphasizing the implementation of a denoising autoencoder variant trained through Domain Randomization on simulated 3D model views.

Key Contributions

The AAE demonstrates several advantages over traditional methods:

Avoidance of Pose-Annotated Data: The model circumvents the need for real, pose-annotated datasets by learning directly from synthetic data. This is achieved without explicitly mapping input images to object poses but provides an implicit representation of 3D orientations in a latent space.
Generalization Across Sensors: The method inherently accommodates various test sensors and addresses object and view symmetries, thus providing robustness against occlusion, clutter, and differing environments.
Efficient Training and Representation: By utilizing a domain randomization strategy, the AAE becomes invariant to many real versus simulated image discrepancies, a pivotal aspect of its robust generalization.

Performance Metrics

The paper details the AAE's state-of-the-art performance on the T-LESS dataset and competitive results on the LineMOD dataset, with key numerical values and comparisons illustrating its efficacy:

T-LESS Dataset: Achieves a recall of 68.57% with $err_{vsd}<0.3$ in RGB and 84.05% with additional depth input, outperforming numerous previous benchmarks.
LineMOD Dataset: Attains a mean recall of 32.63% in RGB-only mode, improving significantly post-ICP refinement to 71.58%.

Architectural and Implementation Details

The encoder architecture comprises convolutional neural networks (CNNs), where geometric and color augmentations optimize for domain invariance. Moreover, the model leverages codebooks created from synthetic views, providing a rapid k-Nearest-Neighbor search to estimate object orientations efficiently.

Implications and Future Prospects

The research has profound implications for applications requiring fast and annotation-free 3D pose estimation, such as in autonomous navigation and augmented reality. The integration of domain randomization and autoencoding into object detection reveals a myriad of possibilities for further exploration:

Enhancement in Domains Beyond Real-time Pose Estimation: The ability to generalize from synthetic to real environments could be expanded to other computer vision tasks.
Investigation into More Complex Models: Future research might focus on extending the AAE framework with more sophisticated augmentations or implementing it in more complex environments.

In conclusion, the paper provides a significant contribution to 6D object detection by addressing key limitations of existing systems and proposing a scalable and efficient alternative through Augmented Autoencoders. This approach sets the foundation for future enhancements and broader applications in artificial intelligence and real-time vision systems.