Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild (2210.07199v3)

Published 13 Oct 2022 in cs.CV, cs.LG, and cs.RO

Abstract: While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semi-supervised methods on in-the-wild images. Our project page is: https://kywind.github.io/self-pose .

PDF Abstract

Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild

The paper "Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild" presents a novel approach to 6D object pose estimation, leveraging self-supervised learning over large-scale real-world video datasets. This work addresses a crucial challenge in computer vision and robotics by focusing on category-level pose estimation, which requires generalizing beyond seen instances, and operates without reliance on costly human annotations or simulation datasets.

Contributions and Methodology

The authors propose a framework that reconstructs a canonical 3D shape of an object category, learning dense correspondences between 2D image inputs and 3D canonical shapes. They introduce a Categorical Surface Embedding (CSE) representation to facilitate dense geometric correspondence learning. This CSE representation defines a mapping from pixel features in the image space to vertex features on the canonical object mesh.

Two innovative cycle-consistency losses are central to the training process within this framework. Instance cycle-consistency focuses on a single object, aiming to establish a consistency between the 2D-3D mapping and the projections in the image space. Additionally, cross-instance cycle-consistency applies across different objects within the same category, further strengthening the correspondence learning by aligning semantic features through space and across time via video data.

Experimental Results

The methodology was validated on the Wild6D and REAL275 datasets. Notably, the self-supervised method outperformed several state-of-the-art supervised and semi-supervised approaches, particularly on standard metrics like mAP for different thresholds of the 3D Intersection over Union (IoU) and degree-centimeter accuracy. In the Wild6D dataset tests, the proposed method achieved remarkable results such as a 92.3% mAP at IoU $_{0.25}$ and surpassed existing methods on metrics involving 5-degree and 10-degree pose thresholds combined with distance measurements.

The paper extended beyond pose estimation, applying its CSE representation framework to the keypoint transfer tasks using the CUB-200-2011 dataset. This task exemplifies the versatility of the learned correspondences, showcasing significant improvements over current methods, with a PCK (Percentage of Correct Keypoints) of 64.5%.

Implications and Future Directions

This research significantly advances the potential for scalable, robust category-level 6D pose estimation in unstructured environments, eliminating dependency on labeled datasets for model training and thus greatly reducing overall setup costs in practice.

Looking ahead, the implications of this work could inspire new methods in unsupervised and self-supervised learning approaches for object pose estimation and broader visual recognition tasks. Further advancements could include expanding the framework's capability to handle more complex scenes with multiple interacting objects or integrating additional modalities to enhance robustness and accuracy in real-world applications. Additionally, the extension of this approach to dynamic scenes involving articulated objects could open new avenues in robotics and augmented reality.

The results of the paper unequivocally demonstrate that the foundational premise of exploiting video data and geometric consistency can yield performance equal to or exceeding that of more traditional, annotation-heavy approaches, pointing to a promising future for self-supervised learning methodologies in visual perception.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Kaifeng Zhang (11 papers)
Yang Fu (43 papers)
Shubhankar Borse (25 papers)
Hong Cai (51 papers)
Fatih Porikli (141 papers)
Xiaolong Wang (243 papers)

Citations (24)

View on Semantic Scholar

Related Papers

Find Related Papers

Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild (2210.07199v3)