SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation (2311.15707v2)

Published 27 Nov 2023 in cs.CV

Abstract: Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.

References (74)

Citations (33)

View on Semantic Scholar

Summary

The paper presents SAM-6D, a framework leveraging zero-shot segmentation to estimate 6D object poses without object-specific training.
It integrates an Instance Segmentation Model with semantic, appearance, and geometric matching to generate precise, class-agnostic object proposals.
The approach uses dual-stage Point Transformers with background tokens for refined pose matching, outperforming previous benchmarks in cluttered scenes.

Overview of "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation"

The paper "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation" introduces SAM-6D, a framework designed to estimate the 6D poses of novel objects in cluttered environments without requiring prior training on those objects. Recognizing the challenges imposed by zero-shot settings for both object detection and pose estimation, the authors leverage the Segment Anything Model (SAM) to address these demands.

Key Components and Methodology

SAM-6D consists of two primary components: the Instance Segmentation Model (ISM) and the Pose Estimation Model (PEM).

Instance Segmentation Model (ISM):
- ISM uses SAM's zero-shot capabilities to generate non-specific, class-agnostic object proposals from RGB images.
- A novel object matching score is calculated for each proposal, considering semantics, appearance, and geometry to filter and retain valid proposals.
- Semantic matching leverages DINOv2's ViT model to compare proposals with object templates, determining the semantic similarity.
- Appearance matching further refines this by evaluating patch-wise similarities.
- The geometric score assesses congruity with potential object shapes and sizes through bounding box IoU metrics.
Pose Estimation Model (PEM):
- PEM treats pose estimation as a partial-to-partial point matching problem.
- It introduces background tokens to resolve issues from occlusions and missing correspondences efficiently.
- The model operates in two stages: Coarse Point Matching for initial pose estimation using sparse point pairs, and Fine Point Matching for refining poses through dense correspondence.
- Sparse-to-Dense Point Transformers enhance efficiency by aligning sparse and dense point interactions.

Results and Implications

Addressing the BOP benchmark datasets, SAM-6D demonstrates superior generalization capabilities in both segmentation and pose estimation tasks.

Performance Metrics:
- SAM-6D outperformed previous methods in instance segmentation with high mAP scores across diverse datasets.
- For pose estimation, SAM-6D achieved high AR scores, demonstrating its effectiveness even when using only generic segmentation models.
Significance:
- The integration of SAM with existing segmentation paradigms provides a viable pathway for robust zero-shot applications.
- The design of the PEM, particularly the novel background tokens and dual-stage processing, presents new approaches to efficiently estimate poses without extensive computational resources.

Future Directions

The research opens pathways for developing more refined zero-shot learning strategies, potentially integrating more sophisticated data augmentation techniques or expanding to broader object categories. The methodologies proposed could inspire further investigation into real-time applications, especially with considerations of computational efficiency and the practicality of pipeline integration in robotics and AR systems.

Conclusion

SAM-6D represents a significant step in leveraging generalized segmentation models for specific, advanced tasks in computer vision. By strategically integrating SAM and devising an innovative pose estimation pipeline, the authors have set a new standard for zero-shot learning applicability in the field of 6D object pose estimation.

PDF Markdown

GitHub

GitHub - JiehongLin/SAM-6D: [CVPR2024] Code for "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation". (539 stars)