FoundPose: Unseen Object Pose Estimation with Foundation Features (2311.18809v2)

Published 30 Nov 2023 in cs.CV and cs.RO

Abstract: We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Project page: evinpinar.github.io/foundpose.

References (104)

Authors (7)

Evin Pınar Örnek (14 papers)
Yann Labbé (12 papers)
Bugra Tekin (22 papers)
Lingni Ma (19 papers)
Cem Keskin (22 papers)
Christian Forster (6 papers)
Tomas Hodan (22 papers)

Citations (21)

View on Semantic Scholar

Summary

Introduction to FoundPose

The field of spatial AI encounters the challenge of perceiving the environment, where understanding the position and orientation of objects—known as 6D pose estimation—is vital. This process is critical for applications such as robotic manipulation and mixed reality. Traditionally, methods required ample object-specific data, limiting their ability to recognize and interact with unseen objects. Addressing the need for a versatile system, FoundPose offers a distinctive approach to estimate the pose of unseen objects using only a single RGB image, without necessitating extensive object-specific training data.

Core Methodology of FoundPose

FoundPose is built upon a foundation model called DINOv2, known for its strong generalization capabilities. Starting with 3D models of unseen objects, FoundPose generates numerous rendered templates of the objects to form the basis for pose estimation. When evaluating a real image containing an object, FoundPose employs a bag-of-words retrieval method accelerated by DINOv2 features. It quickly identifies a subset of template images that closely resemble the object seen in the image.

Next, for each selected template, FoundPose establishes correspondences between 2D features observed in the image and their 3D counterparts defined by the template. These correspondences yield pose hypotheses that are further optimized through two refinement stages:

Featuremetric refinement, which iteratively aligns features more accurately.
MegaPose refinement, enhancing precision despite initial coarse pose estimates.

Empirical Validation and Insights

FoundPose demonstrates superior performance when evaluated on the BOP benchmark, a standard for 6D object pose estimation. It excels against competing RGB-based methods in both accuracy and computation speed, validating its effectiveness in estimating object poses robustly and efficiently. The approach shows promise in dealing with a broad spectrum of objects, including those lacking distinct texture or exhibiting symmetrical features.

Conclusion

FoundPose's achievements underscore the power of leveraging foundation models, such as DINOv2, in computer vision tasks. It offers an efficient and practical solution for the accurate 6D pose estimation of unseen objects, paving the way for real-world applications in need of rapid and reliable object pose recognition. It operates efficiently with limited object-specific input and can readily scale to accommodate a vast array of objects. FoundPose thus represents a significant step forward in model-based pose estimation techniques.

PDF Markdown

GitHub

FoundPose: Unseen Object Pose Estimation

Tweets

https://twitter.com/OWW/status/1815429866059120675