FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (2312.08344v2)
Abstract: We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a LLM, a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/
- Learning 6D object pose estimation using 3d object coordinates. In 13th European Conference on Computer Vision (ECCV), pages 536–551, 2014.
- OVE6D: Object viewpoint encoding for depth-based 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6803–6813, 2022.
- Reconstruct locally, localize globally: A model free method for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3153–3163, 2020.
- TexFusion: Synthesizing 3D textures with text-guided image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4169–4181, 2023.
- Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pages 11973–11982, 2020.
- Objaverse: A universe of annotated 3D objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023.
- ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
- PoseRBPF: A Rao-Blackwellized particle filter for 6D object pose tracking. In Robotics: Science and Systems (RSS), 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Google scanned objects: A high-quality dataset of 3D scanned household items. In International Conference on Robotics and Automation (ICRA), pages 2553–2560, 2022.
- Implicit geometric regularization for learning shapes. In International Conference on Machine Learning (ICML), pages 3789–3799, 2020.
- LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356–5364, 2019.
- John C Hart. Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer, 12(10):527–545, 1996.
- Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6749–6758, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Mask R-CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2961–2969, 2017.
- OnePose++: Keypoint-free one-shot object pose estimation without CAD models. Advances in Neural Information Processing Systems (NeurIPS), 35:35103–35115, 2022a.
- PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11632–11641, 2020.
- FFB6D: A full flow bidirectional fusion network for 6D pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3003–3013, 2021.
- FS6D: Few-shot 6D pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6814–6824, 2022b.
- Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In International Conference on Computer Vision (ICCV), pages 858–865, 2011.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020.
- T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 880–888, 2017.
- BOP: Benchmark for 6D object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
- Deep metric learning using triplet network. In Third International Workshop on Similarity-Based Pattern Recognition (SIMBAD), pages 84–92, 2015.
- PREDATOR: Registration of 3D point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4267–4276, 2021.
- Depth-based object tracking using a robust gaussian filter. In IEEE International Conference on Robotics and Automation (ICRA), pages 608–615, 2016.
- Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters, 3(3):1864–1871, 2018.
- CosyPose: Consistent multi-view multi-object 6D pose estimation. In European Conference on Computer Vision (ECCV), pages 574–591, 2020.
- MegaPose: 6D pose estimation of novel objects via render & compare. In 6th Annual Conference on Robot Learning (CoRL), 2022.
- TTA-COPE: Test-time adaptation for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21285–21295, 2023.
- NeRF-Pose: A first-reconstruct-then-regress approach for weakly-supervised 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2123–2133, 2023.
- DeepIM: Deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 683–698, 2018.
- CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In CVF International Conference on Computer Vision (ICCV), pages 7677–7686, 2019.
- Microsoft COCO: Common objects in context. In 13th European Conference on Computer Vision (ECCV), pages 740–755, 2014.
- Keypoint-based category-level object pose tracking from an RGB sequence with uncertainty estimation. In International Conference on Robotics and Automation (ICRA), 2022.
- Gen6D: Generalizable model-free 6-DoF object pose estimation from RGB images. ECCV, 2022.
- Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
- Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics (TVCG), 22(12):2633–2651, 2015.
- NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
- Templates for 3D object pose estimation revisited: Generalization to new objects and robustness to occlusions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6771–6780, 2022.
- Zephyr: Zero-shot pose hypothesis rating. In IEEE International Conference on Robotics and Automation (ICRA), pages 14141–14148, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7668–7677, 2019.
- LatentFusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10710–10719, 2020.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- OSOP: A multi-stage one shot object pose estimation framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6835–6844, 2022.
- Iterative corresponding geometry: Fusing region and depth for highly efficient 3D tracking of textureless objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6855–6865, 2022.
- LoFTR: Detector-free local feature matching with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, 2021.
- OnePose: One-shot object pose estimation without CAD models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6825–6834, 2022.
- Shape prior deformation for categorical 6D object pose and size estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 530–546, 2020.
- Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning (CoRL), pages 306–316, 2018.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- 6-PACK: Category-level 6D pose tracker with anchor-based keypoints. In IEEE International Conference on Robotics and Automation (ICRA), pages 10059–10066, 2020.
- Normalized object coordinate space for category-level 6D object pose and size estimation. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pages 2642–2651, 2019.
- NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- BundleTrack: 6D pose tracking for novel objects without instance or category-level 3D models. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8067–8074, 2021.
- se(3)-TrackNet: Data-driven 6D pose tracking by calibrating image residuals in synthetic domains. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10367–10373, 2020.
- CatGrasp: Learning category-level task-relevant grasping in clutter from simulation. In International Conference on Robotics and Automation (ICRA), pages 6401–6408, 2022.
- BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 606–617, 2023.
- Probabilistic object tracking using a range camera. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3195–3202, 2013.
- PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), 2018.
- Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems (NeurIPS), 33:2492–2502, 2020.
- SSP-Pose: Symmetry-aware shape prior deformation for direct category-level object pose estimation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7452–7459, 2022.
- Learning symmetry-aware geometry correspondences for 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14045–14054, 2023.
- HS-Pose: Hybrid scope feature extraction for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17163–17173, 2023.
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.