Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FoundPose: Unseen Object Pose Estimation with Foundation Features (2311.18809v2)

Published 30 Nov 2023 in cs.CV and cs.RO

Abstract: We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Project page: evinpinar.github.io/foundpose.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. Marc Alexa. Super-fibonacci spirals: Fast, low-discrepancy sampling of SO(3). CVPR, 2022.
  2. Deep vit features as dense visual descriptors. ECCVW, 2022.
  3. ZS6D: Zero-shot 6D object pose estimation using vision transformers. arXiv preprint arXiv:2309.11986, 2023.
  4. Pose guided RGBD feature learning for 3D object pose estimation. ICCV, 2017.
  5. Jonathan T. Barron. A general and adaptive robust loss function. CVPR, 2019.
  6. SURF: Speeded up robust features. ECCV, 2006.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Learning 6D object pose estimation using 3D object coordinates. ECCV, 2014.
  9. Language models are few-shot learners. Advances in neural information processing systems, 2020.
  10. Emerging properties in self-supervised vision transformers. ICCV, 2021.
  11. ZeroPose: CAD-model-based zero-shot pose estimation. arXiv preprint arXiv:2305.17934, 2023.
  12. Reproducible scaling laws for contrastive language-image learning. CVPR, 2023.
  13. The MOPED framework: Object recognition and pose estimation for manipulation. IJRR, 2011.
  14. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. ACL, 2019.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  17. Recovering 6D object pose and predicting next-best-view in the crowd. CVPR, 2016.
  18. Introducing MVTec ITODD – A dataset for 3D object recognition in industry. ICCVW, 2017.
  19. The representation, recognition, and locating of 3-D objects. IJRR, 1986.
  20. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981.
  21. S2DNet: Learning accurate correspondences for sparse-to-dense feature matching. ECCV, 2020.
  22. ImageBind: One embedding space to bind them all. CVPR, 2023.
  23. Zero-shot category-level object pose estimation. ECCV, 2022.
  24. Shape-Constraint Recurrent Flow for 6D Object Pose Estimation. CVPR, 2023.
  25. Mask R-CNN. ICCV, 2017.
  26. OnePose++: Keypoint-free one-shot object pose estimation without CAD models. NeurIPS, 2022.
  27. FS6D: Few-shot 6D pose estimation of novel objects. CVPR, 2022.
  28. Gradient response maps for real-time detection of textureless objects. TPAMI, 2012.
  29. Dominant orientation templates for real-time detection of texture-less objects. CVPR, 2010.
  30. EPOS: Estimating 6D pose of objects with symmetries. CVPR, 2020.
  31. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. WACV, 2017.
  32. BOP: Benchmark for 6D object pose estimation. ECCV, 2018.
  33. BOP challenge 2020 on 6D object localization. ECCVW, 2020.
  34. BOP challenge 2023 on detection, segmentation and pose estimation of unseen rigid objects. To be published., 2023. The results are available at bop.felk.cvut.cz.
  35. Segmentation-driven 6D object pose estimation. CVPR, 2019.
  36. Recognizing solid objects by alignment with an image. IJCV, 1990.
  37. Scaling up visual and vision-language representation learning with noisy text supervision. ICML, 2021.
  38. HomebrewedDB: RGB-D dataset for 6D pose estimation of 3D objects. ICCVW, 2019.
  39. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. ICCV, 2017.
  40. Segment anything. ICCV, 2023.
  41. MegaPose: 6D pose estimation of novel objects via render & compare. CoRL, 2022.
  42. CosyPose: Consistent multi-view multi-object 6D pose estimation. ECCV, 2020.
  43. EPnP: An accurate O(n) solution to the PnP problem. IJCV, 2009.
  44. Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 1944.
  45. DeepIM: Deep iterative matching for 6D pose estimation. IJCV, 2020.
  46. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-Dof object pose estimation. ICCV, 2019.
  47. Gen6D: Generalizable model-free 6-DoF object pose estimation from RGB images. ECCV, 2022.
  48. Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310, 2023.
  49. David G. Lowe. Perceptual Organization and Visual Recognition. The Kluwer International Series in Engineering and Computer Science. Springer US, 1985.
  50. David G Lowe. Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 1987.
  51. David G. Lowe. Object recognition from local scale-invariant features. ICCV, 1999.
  52. David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  53. Deep Model-Based 6D Pose Refinement in RGB. ECCV, 2018.
  54. Donald W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 1963.
  55. Robust wide-baseline stereo from maximally stable extremal regions. BMCV, 2002.
  56. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. CVPR, 2022.
  57. GenFlow, a submission to the BOP Challenge 2023 (bop.felk.cvut.cz). Unpublished, 2023.
  58. Visual learning and recognition of 3-D objects from appearance. IJCV, 1995.
  59. KinectFusion: Real-time dense surface mapping and tracking. ISMAR, 2011.
  60. CNOS: A strong baseline for CAD-based novel object segmentation. ICCVW, 2023.
  61. Templates for 3D object pose estimation revisited: Generalization to new objects and robustness to occlusions. CVPR, 2022.
  62. Training a feedback loop for hand pose estimation. ICCV, 2015.
  63. ZePHyR: Zero-shot pose hypothesis rating. ICRA, 2021.
  64. Automatic target recognition by matching oriented edge pixels. IEEE Transactions on Image Processing, 1997.
  65. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  66. SupeRGB-D: Zero-shot instance segmentation in cluttered indoor environments. IEEE RA-L, 2023.
  67. LatentFusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. CVPR, 2020.
  68. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. ICCV, 2019.
  69. 6-DoF object pose from semantic keypoints. ICRA, 2017.
  70. PVNet: Pixel-wise voting network for 6DoF pose estimation. CVPR, 2019.
  71. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007.
  72. Lost in quantization: Improving particular object retrieval in large scale image databases. CVPR, 2008.
  73. 3D object detection and pose estimation of unseen objects in color images with local surface embeddings. ACCV, 2020.
  74. CorNet: Generic 3D corners for 6D pose estimation of new objects without retraining. ICCVW, 2019.
  75. Learning transferable visual models from natural language supervision. ICML, 2021.
  76. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. ICCV, 2021.
  77. Lawrence G Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology, 1963.
  78. 3D object modeling and recognition using affine-invariant patches and multi-view spatial constraints. CVPR, 2003.
  79. Back to the feature: Learning robust camera localization from pixels to pose. CVPR, 2021.
  80. LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS Datasets and Benchmarks Track, 2022.
  81. Dave Shreiner. OpenGL programming guide: the official guide to learning OpenGL, versions 3.0 and 3.1. Pearson Education, 2009.
  82. OSOP: A multi-stage one shot object pose estimation framework. CVPR, 2022.
  83. Localizing objects with self-supervised transformers and no labels. BMVC, 2021.
  84. Sivic and Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV, 2003.
  85. HybridPose: 6D Object Pose Estimation under Hybrid Representations. CVPR, 2020.
  86. LoFTR: Detector-free local feature matching with transformers. CVPR, 2021.
  87. OnePose: One-shot object pose estimation without CAD models. CVPR, 2022.
  88. Multi-path learning for object pose estimation across domains. CVPR, 2020.
  89. BOP challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. CVPRW, 2023.
  90. Implicit 3D orientation learning for 6D object detection from rgb images. ECCV, 2018.
  91. Real-time seamless single shot 6D object pose prediction. CVPR, 2018.
  92. Self-supervised vision transformers for 3D pose estimation of novel objects. arXiv preprint arXiv:2306.00129, 2023.
  93. BOLD features to detect texture-less objects. ICCV, 2013.
  94. Visual place recognition with repetitive structures. CVPR, 2013.
  95. LM-Reloc: Levenberg-Marquardt based direct visual relocalization. 3DV, 2020.
  96. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. CVPR, 2021.
  97. Self-supervised transformers for unsupervised object discovery using normalized cut. CVPR, 2022.
  98. Learning descriptors for object recognition and 3D pose estimation. CVPR, 2015.
  99. Multiview compressive coding for 3D reconstruction. CVPR, 2023.
  100. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. RSS, 2018.
  101. Pose from shape: Deep pose estimation for arbitrary 3D objects. BMVC, 2019.
  102. DPOD: 6D Pose Object Detector and Refiner. ICCV, 2019.
  103. A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence. NeurIPS, 2023.
  104. iBOT: Image BERT pre-training with online tokenizer. ICLR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Evin Pınar Örnek (14 papers)
  2. Yann Labbé (12 papers)
  3. Bugra Tekin (22 papers)
  4. Lingni Ma (19 papers)
  5. Cem Keskin (22 papers)
  6. Christian Forster (6 papers)
  7. Tomas Hodan (22 papers)
Citations (21)

Summary

Introduction to FoundPose

The field of spatial AI encounters the challenge of perceiving the environment, where understanding the position and orientation of objects—known as 6D pose estimation—is vital. This process is critical for applications such as robotic manipulation and mixed reality. Traditionally, methods required ample object-specific data, limiting their ability to recognize and interact with unseen objects. Addressing the need for a versatile system, FoundPose offers a distinctive approach to estimate the pose of unseen objects using only a single RGB image, without necessitating extensive object-specific training data.

Core Methodology of FoundPose

FoundPose is built upon a foundation model called DINOv2, known for its strong generalization capabilities. Starting with 3D models of unseen objects, FoundPose generates numerous rendered templates of the objects to form the basis for pose estimation. When evaluating a real image containing an object, FoundPose employs a bag-of-words retrieval method accelerated by DINOv2 features. It quickly identifies a subset of template images that closely resemble the object seen in the image.

Next, for each selected template, FoundPose establishes correspondences between 2D features observed in the image and their 3D counterparts defined by the template. These correspondences yield pose hypotheses that are further optimized through two refinement stages:

  1. Featuremetric refinement, which iteratively aligns features more accurately.
  2. MegaPose refinement, enhancing precision despite initial coarse pose estimates.

Empirical Validation and Insights

FoundPose demonstrates superior performance when evaluated on the BOP benchmark, a standard for 6D object pose estimation. It excels against competing RGB-based methods in both accuracy and computation speed, validating its effectiveness in estimating object poses robustly and efficiently. The approach shows promise in dealing with a broad spectrum of objects, including those lacking distinct texture or exhibiting symmetrical features.

Conclusion

FoundPose's achievements underscore the power of leveraging foundation models, such as DINOv2, in computer vision tasks. It offers an efficient and practical solution for the accurate 6D pose estimation of unseen objects, paving the way for real-world applications in need of rapid and reliable object pose recognition. It operates efficiently with limited object-specific input and can readily scale to accommodate a vast array of objects. FoundPose thus represents a significant step forward in model-based pose estimation techniques.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com