Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (2312.08344v2)

Published 13 Dec 2023 in cs.CV, cs.AI, and cs.RO

Abstract: We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a LLM, a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Learning 6D object pose estimation using 3d object coordinates. In 13th European Conference on Computer Vision (ECCV), pages 536–551, 2014.
  2. OVE6D: Object viewpoint encoding for depth-based 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6803–6813, 2022.
  3. Reconstruct locally, localize globally: A model free method for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3153–3163, 2020.
  4. TexFusion: Synthesizing 3D textures with text-guided image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4169–4181, 2023.
  5. Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pages 11973–11982, 2020.
  6. Objaverse: A universe of annotated 3D objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023.
  7. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  8. PoseRBPF: A Rao-Blackwellized particle filter for 6D object pose tracking. In Robotics: Science and Systems (RSS), 2019.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  10. Google scanned objects: A high-quality dataset of 3D scanned household items. In International Conference on Robotics and Automation (ICRA), pages 2553–2560, 2022.
  11. Implicit geometric regularization for learning shapes. In International Conference on Machine Learning (ICML), pages 3789–3799, 2020.
  12. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356–5364, 2019.
  13. John C Hart. Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer, 12(10):527–545, 1996.
  14. Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6749–6758, 2022.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  16. Mask R-CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2961–2969, 2017.
  17. OnePose++: Keypoint-free one-shot object pose estimation without CAD models. Advances in Neural Information Processing Systems (NeurIPS), 35:35103–35115, 2022a.
  18. PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11632–11641, 2020.
  19. FFB6D: A full flow bidirectional fusion network for 6D pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3003–3013, 2021.
  20. FS6D: Few-shot 6D pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6814–6824, 2022b.
  21. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In International Conference on Computer Vision (ICCV), pages 858–865, 2011.
  22. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020.
  23. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 880–888, 2017.
  24. BOP: Benchmark for 6D object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
  25. Deep metric learning using triplet network. In Third International Workshop on Similarity-Based Pattern Recognition (SIMBAD), pages 84–92, 2015.
  26. PREDATOR: Registration of 3D point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4267–4276, 2021.
  27. Depth-based object tracking using a robust gaussian filter. In IEEE International Conference on Robotics and Automation (ICRA), pages 608–615, 2016.
  28. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters, 3(3):1864–1871, 2018.
  29. CosyPose: Consistent multi-view multi-object 6D pose estimation. In European Conference on Computer Vision (ECCV), pages 574–591, 2020.
  30. MegaPose: 6D pose estimation of novel objects via render & compare. In 6th Annual Conference on Robot Learning (CoRL), 2022.
  31. TTA-COPE: Test-time adaptation for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21285–21295, 2023.
  32. NeRF-Pose: A first-reconstruct-then-regress approach for weakly-supervised 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2123–2133, 2023.
  33. DeepIM: Deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 683–698, 2018.
  34. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In CVF International Conference on Computer Vision (ICCV), pages 7677–7686, 2019.
  35. Microsoft COCO: Common objects in context. In 13th European Conference on Computer Vision (ECCV), pages 740–755, 2014.
  36. Keypoint-based category-level object pose tracking from an RGB sequence with uncertainty estimation. In International Conference on Robotics and Automation (ICRA), 2022.
  37. Gen6D: Generalizable model-free 6-DoF object pose estimation from RGB images. ECCV, 2022.
  38. Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
  39. Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics (TVCG), 22(12):2633–2651, 2015.
  40. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  41. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
  42. Templates for 3D object pose estimation revisited: Generalization to new objects and robustness to occlusions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6771–6780, 2022.
  43. Zephyr: Zero-shot pose hypothesis rating. In IEEE International Conference on Robotics and Automation (ICRA), pages 14141–14148, 2021.
  44. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  45. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7668–7677, 2019.
  46. LatentFusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10710–10719, 2020.
  47. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  48. OSOP: A multi-stage one shot object pose estimation framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6835–6844, 2022.
  49. Iterative corresponding geometry: Fusing region and depth for highly efficient 3D tracking of textureless objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6855–6865, 2022.
  50. LoFTR: Detector-free local feature matching with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, 2021.
  51. OnePose: One-shot object pose estimation without CAD models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6825–6834, 2022.
  52. Shape prior deformation for categorical 6D object pose and size estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 530–546, 2020.
  53. Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning (CoRL), pages 306–316, 2018.
  54. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  55. 6-PACK: Category-level 6D pose tracker with anchor-based keypoints. In IEEE International Conference on Robotics and Automation (ICRA), pages 10059–10066, 2020.
  56. Normalized object coordinate space for category-level 6D object pose and size estimation. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pages 2642–2651, 2019.
  57. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  58. BundleTrack: 6D pose tracking for novel objects without instance or category-level 3D models. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8067–8074, 2021.
  59. se(3)-TrackNet: Data-driven 6D pose tracking by calibrating image residuals in synthetic domains. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10367–10373, 2020.
  60. CatGrasp: Learning category-level task-relevant grasping in clutter from simulation. In International Conference on Robotics and Automation (ICRA), pages 6401–6408, 2022.
  61. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 606–617, 2023.
  62. Probabilistic object tracking using a range camera. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3195–3202, 2013.
  63. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), 2018.
  64. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems (NeurIPS), 33:2492–2502, 2020.
  65. SSP-Pose: Symmetry-aware shape prior deformation for direct category-level object pose estimation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7452–7459, 2022.
  66. Learning symmetry-aware geometry correspondences for 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14045–14054, 2023.
  67. HS-Pose: Hybrid scope feature extraction for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17163–17173, 2023.
  68. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.
Citations (81)

Summary

  • The paper introduces a unified framework that performs 6D pose estimation and tracking using both model-based and model-free methods.
  • It employs a novel transformer-based architecture with contrastive learning and LLM integration, achieving superior performance across multiple datasets.
  • The approach alleviates dependency on extensive fine-tuning, offering broad applications in robotics and mixed reality with efficient, versatile processing.

FoundationPose: Unified 6D Pose Estimation and Tracking

The paper introduces FoundationPose, a unified framework designed for 6D object pose estimation and tracking of novel objects. This model supports both model-based and model-free approaches, offering instant applicability to novel objects without requiring fine-tuning, provided a CAD model exists or a small set of reference images is available.

FoundationPose stands out due to its robust generalizability which stems from large-scale synthetic training, using a novel transformer-based architecture, contrastive learning, and the integration of a LLM. Evaluations on multiple datasets demonstrate its superior performance over existing methods tailored for specific tasks, and its results are on par with instance-level methods that require more restrictive assumptions.

Methodology

FoundationPose leverages both model-based and model-free strategies, integrating a neural implicit representation for efficient view synthesis when no CAD model is available. This approach unifies the downstream modules for pose estimation across different setups. By employing synthetic training augmented with an LLM and diversified texture augmentation, the model achieves strong generalizability. This is bolstered by a novel transformer-based architecture and a contrastive learning framework.

The system design facilitates high efficiency and smooth performance in tracking tasks, employing temporal cues for enhanced accuracy over video sequences. For novel view synthesis in the model-free setup, an object-centric neural field is utilized, bridging the gap between the setups.

Results

The paper provides compelling numerical results indicating that FoundationPose surpasses existing specialized methods across multiple public datasets. For both pose estimation and tracking, the proposed framework achieves a significant increase in performance metrics. Notably, it offers competitive results to instance-level trained methods without imposing as many constraints.

Implications and Future Work

The practical implications of FoundationPose are substantial as it addresses the limitations of conventional instance and category-level methods, enabling application to arbitrary novel objects—a significant step forward for robotic manipulation and mixed reality applications. Theoretically, the work reflects a shift towards more generalized and flexible models in AI, reducing dependency on extensive instance-specific training data.

Future developments may include the exploration of multi-object pose estimation and further enhancement of the model's ability to handle complex, real-world environments without additional computational cost. Integrating detection into the unified framework could also streamline processes and improve system scalability.

In conclusion, FoundationPose represents a significant development in the field of 6D pose estimation and tracking, presenting a versatile, efficient method applicable across diverse scenarios with reduced prerequisites. The framework not only reinforces the potential of synthetic training environments but also sets a foundation for future innovations in AI-based object manipulation and interaction.

Youtube Logo Streamline Icon: https://streamlinehq.com