SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners (2408.16768v1)

Published 29 Aug 2024 in cs.CV, cs.AI, and cs.CL

Abstract: We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point .

Authors (7)

Ziyu Guo (49 papers)
Renrui Zhang (100 papers)
Xiangyang Zhu (20 papers)
Chengzhuo Tong (4 papers)
Peng Gao (402 papers)
Chunyuan Li (122 papers)
Pheng-Ann Heng (196 papers)

Citations (6)

View on Semantic Scholar

Summary

Sam2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

Introduction

The Segment Anything Model (SAM) has established itself as a superior framework for interactive image segmentation across a range of visual domains. However, extending SAM to 3D segmentation has been an ongoing challenge due to constraints in existing methodologies, including inefficient 2D-3D projection, loss of spatial information, reduced prompting flexibility, and limited domain transferability.

To address these issues, the authors introduce Sam2Point, a novel approach that adapts SAM 2 for efficient, zero-shot, and promptable 3D segmentation. By treating 3D data as a series of multi-directional videos, Sam2Point effectively utilizes SAM 2 for 3D-space segmentation without additional training or 2D-3D projection. This paper argues that Sam2Point provides the most faithful implementation of SAM in 3D, highlighting its potential to serve as a foundational baseline for future promptable 3D segmentation research.

Methodology

3D Data as Videos

Sam2Point uses a novel approach to address the primary challenges of 3D segmentation by converting the 3D data into voxel representations. Each voxelized 3D data is treated as resembling the format of videos. Specifically, the voxelized 3D data, with shape $w \times h \times l \times 3$ , is mimicked as videos with shape $w \times h \times t \times 3$ , ensuring that SAM 2 can process the data while preserving spatial information. This method avoids complex 2D-3D projections, minimizing information degradation and cumbersome post-processing.

Promptable Segmentation

Sam2Point supports three types of 3D prompts: points, boxes, and masks. These prompts facilitate interactive, user-directed segmentation. The specific strategies for handling each type of 3D prompt are:

3D Point Prompt: Utilizes a user-defined point to create orthogonal 2D sections, mimicking six different directional videos for SAM 2.
3D Box Prompt: Projects 3D box prompts into 2D sections for segmentation by SAM 2.
3D Mask Prompt: Uses the intersection of the 3D mask prompt and 2D sections to achieve accurate segmentation.

These methods ensure that Sam2Point maintains the flexibility of user-guided segmentation inherent in SAM while extending it to 3D.

Applicability Across Scenarios

Sam2Point demonstrates a robust generalization capability by successfully segmenting various 3D data types:

3D Objects: Handles complex distributions and overlapping components.
Indoor Scenes: Manages confined space arrangements with multiple objects.
Outdoor Scenes: Deals with large-scale, diverse object classes in broader environments.
Raw LiDAR Data: Segments sparse data without RGB information, relying purely on geometric cues.

The architecture is designed to function uniformly across these scenarios, highlighting its potential for comprehensive 3D segmentation tasks.

Implications and Future Directions

Adaptation of 2D Models to 3D

Sam2Point provides a compelling approach for transferring pre-trained 2D models into the 3D domain. The voxelization process creates a data format suitable for SAM 2 without sacrificing spatial information, potentially representing an optimal trade-off between efficiency and performance. Future research could further validate and optimize this approach to enhance segmentation accuracy and efficiency.

Potential Applications

Sam2Point holds significant potential for advancing various 3D applications:

Fundamental 3D Understanding: It can be a unified backbone for further training and fine-tuning, providing strong initial representations.
Automatic Data Annotation: Useful in generating large-scale segmentation labels, mitigating data scarcity issues in 3D.
3D-Language Vision Learning: Offers joint embedding spaces for enhanced multi-modal learning.
3D LLMs: Can serve as a robust 3D encoder, facilitating 3D token generation for LLMs.

Conclusion

The paper presents a novel framework for zero-shot, promptable 3D segmentation leveraging SAM 2. By representing 3D data as multi-directional videos, Sam2Point efficiently preserves spatial information and supports interactive 3D prompts. Demonstrated to be effective across diverse 3D scenarios, Sam2Point lays the groundwork for future research into adaptive and efficient 3D segmentation. This framework holds promise for various applications, from data annotation to multi-modal learning and 3D understanding, encouraging further exploration into the utilization of SAM 2 for comprehensive 3D segmentation tasks.