Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation (2311.15707v2)

Published 27 Nov 2023 in cs.CV

Abstract: Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Ove6d: Object viewpoint encoding for depth-based 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6803–6813, 2022.
  2. Sad: Segment any rgbd. arXiv preprint arXiv:2305.14207, 2023a.
  3. Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308, 2023b.
  4. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. 3d model-based zero-shot pose estimation pipeline. arXiv preprint arXiv:2305.17934, 2023a.
  6. Semantic segment anything. https://github.com/fudan-zvg/Semantic-Segment-Anything, 2023b.
  7. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2773–2782, 2021.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  10. Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference. arXiv preprint arXiv:2305.15727, 2023.
  11. Zero-shot category-level object pose estimation. In European Conference on Computer Vision, pages 516–532. Springer, 2022.
  12. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5961–5971, 2023a.
  13. Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278, 2023b.
  14. Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6749–6758, 2022.
  15. Scalable mask annotation for video text spotting. arXiv preprint arXiv:2305.01443, 2023.
  16. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  17. Onepose++: Keypoint-free one-shot object pose estimation without cad models. Advances in Neural Information Processing Systems, 35:35103–35115, 2022a.
  18. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11632–11641, 2020.
  19. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3003–3013, 2021.
  20. Fs6d: Few-shot 6d pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6814–6824, 2022b.
  21. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 4267–4276, 2021.
  22. Sam struggles in concealed scenes–empirical study on" segment anything". arXiv preprint arXiv:2304.06022, 2023a.
  23. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750, 2023b.
  24. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  25. An efficient algebraic solution to the perspective-three-point problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7225–7233, 2017.
  26. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  27. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 574–591. Springer, 2020.
  28. Megapose: 6d pose estimation of novel objects via render & compare. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  29. Sparse steerable convolutions: An efficient learning of se (3)-equivariant features for estimation and tracking of object poses in 3d space. Advances in Neural Information Processing Systems, 34:16779–16790, 2021a.
  30. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3560–3569, 2021b.
  31. Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks. In European Conference on Computer Vision, pages 19–34. Springer, 2022.
  32. Vi-net: Boosting category-level 6d object pose estimation via learning decoupled rotations on the spherical representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14001–14011, 2023.
  33. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298–315. Springer, 2022.
  34. Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310, 2023a.
  35. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023b.
  36. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  37. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023.
  38. Templates for 3d object pose estimation revisited: Generalization to new objects and robustness to occlusions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6771–6780, 2022.
  39. Nope: Novel object pose estimation from a single image. arXiv preprint arXiv:2303.13612, 2023a.
  40. Cnos: A strong baseline for cad-based novel object segmentation. arXiv preprint arXiv:2307.11067, 2023b.
  41. Gigapose: Fast and robust novel object pose estimation via one correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  42. Zephyr: Zero-shot pose hypothesis rating. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 14141–14148. IEEE, 2021.
  43. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  44. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023a.
  45. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023b.
  46. Learning to estimate 6dof pose from limited data: A few-shot, generalizable approach using rgb images. arXiv preprint arXiv:2306.07598, 2023.
  47. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  48. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11143–11152, 2022.
  49. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
  50. Osop: A multi-stage one shot object pose estimation framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6835–6844, 2022.
  51. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6738–6748, 2022.
  52. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
  53. Onepose: One-shot object pose estimation without cad models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6825–6834, 2022.
  54. Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2784–2793, 2023.
  55. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023.
  56. Shape prior deformation for categorical 6d object pose and size estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 530–546. Springer, 2020.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3343–3352, 2019a.
  59. Inpaintnerf360: Text-guided 3d inpainting on unbounded neural radiance fields. arXiv preprint arXiv:2305.15094, 2023a.
  60. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16611–16621, 2021.
  61. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2651, 2019b.
  62. Instructedit: Improving automatic masks for diffusion-based image editing with user instructions. arXiv preprint arXiv:2305.18047, 2023b.
  63. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  64. Edit everything: A text-guided generative system for images editing. arXiv preprint arXiv:2304.14006, 2023.
  65. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023a.
  66. Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023b.
  67. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  68. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023a.
  69. A survey on segment anything model (sam): Vision foundation model meets prompt engineering. arXiv preprint arXiv:2306.06211, 2023b.
  70. Sam3d: Zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023c.
  71. Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation. arXiv preprint arXiv:2312.03502, 2023d.
  72. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023e.
  73. Uvosam: A mask-free paradigm for unsupervised video object segmentation via segment anything model. arXiv preprint arXiv:2305.12659, 2023f.
  74. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
Citations (33)

Summary

  • The paper presents SAM-6D, a framework leveraging zero-shot segmentation to estimate 6D object poses without object-specific training.
  • It integrates an Instance Segmentation Model with semantic, appearance, and geometric matching to generate precise, class-agnostic object proposals.
  • The approach uses dual-stage Point Transformers with background tokens for refined pose matching, outperforming previous benchmarks in cluttered scenes.

Overview of "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation"

The paper "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation" introduces SAM-6D, a framework designed to estimate the 6D poses of novel objects in cluttered environments without requiring prior training on those objects. Recognizing the challenges imposed by zero-shot settings for both object detection and pose estimation, the authors leverage the Segment Anything Model (SAM) to address these demands.

Key Components and Methodology

SAM-6D consists of two primary components: the Instance Segmentation Model (ISM) and the Pose Estimation Model (PEM).

  1. Instance Segmentation Model (ISM):
    • ISM uses SAM's zero-shot capabilities to generate non-specific, class-agnostic object proposals from RGB images.
    • A novel object matching score is calculated for each proposal, considering semantics, appearance, and geometry to filter and retain valid proposals.
    • Semantic matching leverages DINOv2's ViT model to compare proposals with object templates, determining the semantic similarity.
    • Appearance matching further refines this by evaluating patch-wise similarities.
    • The geometric score assesses congruity with potential object shapes and sizes through bounding box IoU metrics.
  2. Pose Estimation Model (PEM):
    • PEM treats pose estimation as a partial-to-partial point matching problem.
    • It introduces background tokens to resolve issues from occlusions and missing correspondences efficiently.
    • The model operates in two stages: Coarse Point Matching for initial pose estimation using sparse point pairs, and Fine Point Matching for refining poses through dense correspondence.
    • Sparse-to-Dense Point Transformers enhance efficiency by aligning sparse and dense point interactions.

Results and Implications

Addressing the BOP benchmark datasets, SAM-6D demonstrates superior generalization capabilities in both segmentation and pose estimation tasks.

  • Performance Metrics:
    • SAM-6D outperformed previous methods in instance segmentation with high mAP scores across diverse datasets.
    • For pose estimation, SAM-6D achieved high AR scores, demonstrating its effectiveness even when using only generic segmentation models.
  • Significance:
    • The integration of SAM with existing segmentation paradigms provides a viable pathway for robust zero-shot applications.
    • The design of the PEM, particularly the novel background tokens and dual-stage processing, presents new approaches to efficiently estimate poses without extensive computational resources.

Future Directions

The research opens pathways for developing more refined zero-shot learning strategies, potentially integrating more sophisticated data augmentation techniques or expanding to broader object categories. The methodologies proposed could inspire further investigation into real-time applications, especially with considerations of computational efficiency and the practicality of pipeline integration in robotics and AR systems.

Conclusion

SAM-6D represents a significant step in leveraging generalized segmentation models for specific, advanced tasks in computer vision. By strategically integrating SAM and devising an innovative pose estimation pipeline, the authors have set a new standard for zero-shot learning applicability in the field of 6D object pose estimation.