Deep Model-Based 6D Pose Refinement in RGB (1810.03065v1)

Published 7 Oct 2018 in cs.CV

Abstract: We present a novel approach for model-based 6D pose refinement in color data. Building on the established idea of contour-based pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondence-free, segmentation-free, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in real-time and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code to ensure reproducibility.

Citations (161)

View on Semantic Scholar

Summary

The paper presents a deep learning approach that refines 6D object poses using only RGB data, employing a novel visual loss function for robust, correspondence-free tracking without depth information.
Evaluations show the method achieves near-3D ICP accuracy and robustness to occlusion and symmetry, competing with state-of-the-art using only RGB data.
This RGB-only approach simplifies setup and lowers costs, with significant implications for AR, robotics, and medical applications requiring real-time pose without depth.

The paper presents an innovative approach to refine 6D pose tracking using RGB data through deep learning. The authors focus on enhancing pose estimation without relying on depth data, making the method applicable in environments where only color data is available. The novelty lies in its design of a correspondence-free, segmentation-free, and robust system that can adapt to occlusions, geometric symmetries, and visual ambiguities.

In typical pose refinement scenarios, a rough object pose estimated by detectors needs correction for increased accuracy in tracking. Traditional methods often rely on depth data or intricate correspondence techniques, which introduce dependencies on hand-crafted appearance models. The approach in this paper is different; it employs a deep neural network to predict pose updates entirely from RGB data, eschewing both depth information and explicit appearance modeling.

Methodology

The core innovation involves a novel visual loss function that drives the refinement process by aligning object contours in RGB data. Through a convolutional neural network (CNN), the method learns to predict translational and rotational updates. The visual loss function aligns projected contours of objects, ensuring robust optimization even in scenarios of visual ambiguities.

Significantly, the network is trained using purely synthetic data, which accelerates the training process and circumvents the need for labor-intensive real-world data collection and labeling. The method is tested extensively, showing high resilience against initialization errors, achieving near-3D ICP accuracy without depth data, and running in real-time.

Evaluations and Results

The paper conducts a series of evaluations across multiple datasets, such as 'Hinterstoisser', 'Tejani', and 'Choi'. The results indicate robust performance in pose refinement tasks across various environments. Particularly, the method achieves high Visual Semantic Similarity (VSS) scores, indicative of strong 2D visual alignment, and performs competitively on the ADD metric, used for 3D measurements. The method outperforms or equates with several state-of-the-art RGB and RGB-D approaches, asserting its feasibility in applications lacking depth information.

The robustness to occlusion and symmetry is showcased in tracking experiments on synthetic sequences. Notably, the paper highlights the potential for category-level generalization, with experiments demonstrating the network's capability to track unseen models within known object classes.

Implications and Future Directions

The implications of this research are substantial, particularly for fields requiring accurate real-time pose estimation without the luxury of depth data, such as augmented reality, robotics, and certain medical applications. The methodological shift from correspondence or depth-based methods towards a purely RGB-based approach simplifies the setup and lowers computational costs.

Theoretically, redefining loss functions to focus on visual metrics rather than direct pose metrics provides a promising avenue for dealing with ambiguities inherent to 6D pose estimation. Future explorations could focus on extending this methodology to handle a wider range of object modalities, improving generalization across object categories, and integrating the system into broader visual scene understanding frameworks, such as visual odometry.

By demonstrating the efficacy of RGB-only 6D pose refinement, this research stands as a compelling direction for future investigations in the domain of visual object tracking and pose estimation. The open-source release of training data and refinement code further enhances the work's impact, facilitating reproducibility and fostering continued innovation in the field.

Deep Model-Based 6D Pose Refinement in RGB (1810.03065v1)

Summary