SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model (2306.02245v2)

Published 4 Jun 2023 in cs.CV and eess.IV

Abstract: With the development of LLMs, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.

Authors (7)

Dingyuan Zhang (8 papers)
Dingkang Liang (37 papers)
Hongcheng Yang (3 papers)
Zhikang Zou (25 papers)
Xiaoqing Ye (42 papers)
Zhe Liu (234 papers)
Xiang Bai (222 papers)

Citations (31)

View on Semantic Scholar

Summary

SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model

The paper "SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model" introduces an innovative approach to performing zero-shot 3D object detection by leveraging the Segment Anything Model (SAM), originally designed for 2D image segmentation. This research is a significant step towards utilizing vision foundation models for 3D vision tasks, specifically focusing on the challenging domain of 3D object detection.

Methodology

The authors present SAM3D, a method that adapts SAM for 3D object detection by transforming LiDAR point cloud data into Bird’s Eye View (BEV) images. These BEV images retain 3D depth information while presenting data in a 2D format compatible with SAM. The pipeline for SAM3D consists of five main steps:

LiDAR-to-BEV Projection: Converts sparse LiDAR points into densely populated BEV images using reflection intensity and a predefined palette for RGB encoding. This increases the discriminative power of the BEV images.
BEV Post-Processing: Applies morphological dilation to the BEV images to make them resemble the dense nature of natural images, improving SAM’s ability to segment the images.
Segmentation with SAM: Utilizes mesh grid prompts to cover the BEV images and employ SAM for generating segmentation masks. Pruning strategies are applied to optimize computational efficiency.
Mask Post-Processing: Filters out noisy masks based on area and aspect ratio thresholds derived from prior knowledge.
Mask2Box Translation: Converts the 2D segmentation masks into 3D bounding boxes using both the BEV data and original LiDAR points.

Experimental Results

The method was evaluated on the Waymo Open Dataset, focusing on the detection of VEHICLES within a 30-meter range. Notable findings include:

The implementation of intensity-based and RGB palette transformations for BEV image creation led to improved segmentation results.
BEV post-processing significantly enhanced the alignment between SAM's training data and the BEV images, yielding better detection performance.
A balance was found in the pillar size ($0.1m$), optimizing the trade-off between spatial resolution and noise in segmentation.

Comparison with Fully-Supervised Models

The paper highlighted a substantial performance gap between SAM3D and traditional fully-supervised 3D detectors. The AP and APH metrics of SAM3D were lower, which is expected given SAM’s initial training on 2D tasks. However, the ability to achieve any level of 3D object detection in a zero-shot setting illustrates SAM’s powerful generalization capabilities and the potential for further advancements.

Implications and Future Directions

This research underscores several key implications:

Practical Value: Demonstrates the feasibility of zero-shot 3D object detection, which could significantly reduce the cost and effort associated with 3D data labeling.
Foundation Model Adaptability: Provides a framework for adapting 2D vision models to 3D tasks, paving the way for similar adaptations in other vision-related domains.
Efficiency and Scalability: Shows the potential for using large-scale pre-trained models in real-world applications, even when their original training domains differ from the application domain.

Future Developments

Future research might address several current limitations and expand the capabilities of SAM3D:

Scene Representation: Investigate alternative representations that preserve more detailed spatial information without compromising compatibility with SAM.
Multimodal Integration: Utilize complementary modalities such as RGB images to enhance detection accuracy, particularly in occluded or distant scenarios.
Model Optimization: Apply model compression and distillation techniques to improve inference speed and make the method viable for real-time applications.
Multi-Class Detection: Incorporate vision-LLMs to support multi-class object detection, leveraging semantic information in classification tasks.

Overall, SAM3D represents a compelling advancement in the adaptation of vision foundation models for 3D tasks, highlighting the intersection between 2D and 3D vision research and setting the stage for further innovations in the field of AI-driven perception.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - DYZhang09/SAM3D: [SCIS] SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model (206 stars)