Self6D: Self-Supervised Monocular 6D Object Pose Estimation (2004.06468v3)

Published 14 Apr 2020 in cs.CV

Abstract: 6D object pose estimation is a fundamental problem in computer vision. Convolutional Neural Networks (CNNs) have recently proven to be capable of predicting reliable 6D pose estimates even from monocular images. Nonetheless, CNNs are identified as being extremely data-driven, and acquiring adequate annotations is oftentimes very time-consuming and labor intensive. To overcome this shortcoming, we propose the idea of monocular 6D pose estimation by means of self-supervised learning, removing the need for real annotations. After training our proposed network fully supervised with synthetic RGB data, we leverage recent advances in neural rendering to further self-supervise the model on unannotated real RGB-D data, seeking for a visually and geometrically optimal alignment. Extensive evaluations demonstrate that our proposed self-supervision is able to significantly enhance the model's original performance, outperforming all other methods relying on synthetic data or employing elaborate techniques from the domain adaptation realm.

Authors (6)

Gu Wang (25 papers)
Fabian Manhardt (41 papers)
Jianzhun Shao (11 papers)
Xiangyang Ji (159 papers)
Nassir Navab (459 papers)
Federico Tombari (214 papers)

Citations (120)

View on Semantic Scholar

Summary

Self6D: Self-Supervised Monocular 6D Object Pose Estimation

The paper "Self6D: Self-Supervised Monocular 6D Object Pose Estimation" presents a novel approach to 6D object pose estimation by utilizing self-supervised learning techniques. 6D pose estimation involves determining the three-dimensional position and orientation of objects, which is critical for a myriad of applications in computer vision and robotics. However, the acquisition of annotated training data, especially for real-world applications, introduces significant logistical and technical challenges. To address this, the authors propose a method that leverages neural rendering to enhance a pose estimation model using unannotated RGB-D data, reducing the reliance on annotated datasets.

Methodology

The methodology unfolds in two stages. Initially, the model is trained in a fully supervised manner using synthetic RGB images. Neural networks, particularly Convolutional Neural Networks (CNNs), are adept at generalizing from such data, but synthetic-to-real domain adaptation often poses a challenge due to domain discrepancies. To mitigate this, the authors employ a neural rendering-based approach that utilizes self-supervised learning.

In the subsequent stage, the model is further optimized on unannotated real RGB-D data. The core innovation lies in using neural rendering to self-supervise the model. By rendering the predicted poses and comparing them against real sensor data, the model leverages geometric and visual alignment to refine its predictions. This involves analyzing RGB images along with depth maps and employing differentiable rendering techniques to continually update pose estimations.

Evaluation and Results

The research conducts comprehensive evaluations on several datasets, including LineMOD, HomebrewedDB, and YCB-Video. The proposed self-supervised methodology demonstrates significant improvements over existing methods that rely solely on synthetic data. In particular, the paper reports a clear enhancement in performance metrics such as the Average Recall of the ADD(-S) metric across multiple scenarios and datasets, illustrating the robustness of the method under various conditions, including occluded and cluttered environments.

For instance, on the LineMOD dataset, the proposed method achieves a mean Average Recall of 58.9% without real pose labels, marking a significant improvement over earlier approaches such as DPOD and AAE. When compared against state-of-the-art methods that utilize real annotated data, Self6D narrows the performance gap significantly, suggesting the potential for further reductions in reliance on labeled datasets.

Future Work

The paper concludes with insightful discussions on potential future directions. One primary area of interest lies in eliminating the requirement for depth information during the inference phase, which could further reduce dependencies on specific sensor data. Additionally, more sophisticated integration of 2D detections into the self-supervision framework is anticipated to allow for end-to-end training of fully differentiable models.

Implications and Contributions

This work significantly advances the field of 6D object pose estimation by illustrating the viability of self-supervised learning methods to reduce reliance on expansive annotated datasets. By employing novel rendering-based self-supervision, the research contributes to both theoretical frameworks and practical applications, enhancing the understanding and development of robust, scalable object pose estimation techniques.

Overall, "Self6D" establishes an important precedent for leveraging unsupervised data in traditionally supervised tasks, highlighting methods that could catalyze further innovations in computer vision and beyond.

PDF Markdown

Related Papers

YouTube

Show All Videos