SAM 3D: Advancing Volumetric Segmentation
- SAM 3D is a domain adaptation of the 2D Segment Anything Model, extending segmentation techniques to volumetric data with methods from slice-wise inference to native 3D architectures.
- Recent advances incorporate multi-modal guidance and efficient back-projection, achieving competitive Dice scores and robust performance in medical and scene segmentation.
- The framework is applied across 3D medical image segmentation, scene understanding, and generative 3D reconstruction, showcasing practical improvements over traditional methods.
The term "SAM 3D" encompasses a spectrum of methods that extend, adapt, or leverage the Segment Anything Model (SAM)—originally designed for 2D vision tasks—into the 3D domain. SAM 3D approaches comprise both architectural adaptations of SAM for native 3D input, as well as algorithmic pipelines that fuse 2D SAM outputs into 3D representations for segmentation, detection, part decomposition, and generative shape modeling. These advances address medical imaging, scene understanding, robotics, AR/VR content creation, and 3D asset processing.
1. Foundations and Problem Motivation
SAM is a promptable vision foundation model (ViT backbone) trained on the largest available 2D segmentation dataset (SA-1B). It enables high-quality 2D mask generation from user-provided prompts. The central challenge in "SAM 3D" is transferring SAM's prompt-driven or auto-segmentation capabilities to 3D data domains—volumetric medical images (CT, MRI, PET), 3D point clouds, meshes, radiance fields, and multi-view scenes. These domains present inherent obstacles for SAM's 2D design: spatial anisotropy, volumetric context, and the necessity for consistent region identification across 3D space.
Early adoptions of SAM in 3D favored slice-wise segmentation and recomposition strategies. More recent developments have focused on end-to-end 3D model architectures, prompt/adapter extensions, efficient attention schemes, and open-vocabulary instance labeling.
2. 3D Medical Image Segmentation with SAM
A primary application area for SAM 3D is semi-automated and automated segmentation in volumetric medical images.
2.1. 2D-to-3D Slicing and Interactive Prompting
SAMM adapts 2D SAM inference to 3D medical images within the 3D Slicer platform. It operates as a two-process system: the SAM server performs per-slice embedding and prompt-based inference, while the Slicer plugin manages volumetric I/O, prompt collection, and mask visualization. Each 3D volume is unwrapped into ordered 2D slices, which are encoded only once. Upon user prompt, SAM infers masks on single slices, which are projected back into the 3D scene (RAS space) and assembled to reconstruct full 3D segmentations. Real-time interaction is enabled with per-slice latency of around 0.61 s following a one-time embedding cost of ~163 s for a 352×352×240 volume (Liu et al., 2023).
SAM3D takes a general zero-shot approach, leveraging the pretrained 2D SAM for slice-wise inference. It introduces 3D polyline prompting, volume slicing along multiple axes (octahedral, cubic, icosahedral), and slice aggregation by back-projection. Post-processing includes outlier removal and voxel-level morphological refinement. Empirical evaluation on BTCV and BraTS benchmarks yields Dice scores competitive with fully automated nnU-Net, especially for large structures (liver: 0.95, lungs: 0.98) (Chan et al., 10 May 2024).
2.2. Fully 3D Architectural Adaptations
Multiple works have developed "direct 3D" extensions of SAM's ViT and mask decoder.
- 3DSAM-adapter: Factorizes the patch embedding for 3D inputs, incorporates learnable depthwise convolutions, and interleaves spatial adapters in transformer blocks. Point prompts are encoded by trilinear interpolation in the feature volume. The 3D bottleneck and multi-layer aggregation decoder enable strong performance on small, irregular tumors, with Dice improvements up to +29.87% for pancreas tumors over classic nnU-Net (Gong et al., 2023).
- AutoProSAM: Extends all key SAM components (patch embedding, positional encoding, attention) into 3D using parameter-efficient adapters and LoRA-style layers. Automatic prompt generation replaces manual prompting, utilizing a light 3D FCN to create prompt embeddings. On BTCV, AMOS, CT-ORG, and pelvic datasets, this approach outperformed both nnU-Net and SwinUNETR, achieving mDice scores up to 91.30 on pelvic data (Li et al., 2023).
2.3. Prompt Integration and Cross-Modal Designs
Recent models integrate external semantic guidance (text prompts, organ priors, CLIP embeddings):
- TAGS: Combines a 3D-adapted SAM with CLIP-based text encoders and anatomy-specific organ prompts. Multi-modal alignment adapters fuse visual and semantic features. The system yields up to +46.88% average Dice over nnUNet for 3D tumor segmentation and demonstrates robust multi-prompt performance (Li et al., 21 May 2025).
- RefSAM3D: Extends the ViT backbone to 3D with convolutional adapters, incorporates cross-modal reference prompt generation (CLIP), and hierarchical cross-attention across four scales. Its mask decoder features multi-layer aggregation for direct 3D mask generation. RefSAM3D surpassed nnU-Net and other 3D CNN/transformer baselines on multiple benchmarks, especially in zero- and few-shot scenarios (e.g., BTCV mean Dice: 88.3 vs. nnU-Net 86.3) (Gao et al., 7 Dec 2024).
- Memorizing SAM: Introduces a memorizing Transformer plug-in atop a 3D ViT backbone. A memory bank of high-fidelity key-value pairs is leveraged via kNN soft attention, offering dynamic fusion with local context during inference. On the TotalSegmentator dataset (33 classes), it improves average Dice by 11.36% over FastSAM3D at <5 ms per-volume latency increase (Shao et al., 18 Dec 2024).
2.4. Weak- and Semi-Supervised Extensions
- SFR (Stitching, Fine-tuning, Re-training) converts volumes into large stitched 2D images and fine-tunes SAM with LoRA heads. The fine-tuned model generates pseudo-labels, which are then used to bootstrap 3D U-Net/V-Net segmenters in a semi-supervised loop, yielding major Dice gains in low-label regimes (e.g., LA dataset, Mean Teacher+SAM: 74% with only 1 labeled volume) (Li et al., 17 Mar 2024).
- SAM 2: The video-adapted version of SAM is assessed for 3D segmentation by treating 3D scans as videos, prompting on key slices, and propagating via learned memory attention. Out-of-the-box mean IoU on 3D CT/MRI/PET is low (<0.20), with best results when annotating the central slice and propagating bi-directionally (Dong et al., 1 Aug 2024).
3. SAM 3D in General and Medical Scenes
Unlike the medical-focused pipelines, several works employ SAM 3D for scene-level 3D object detection and segmentation:
- SAM3D: Zero-Shot 3D Object Detection (Zhang et al., 2023) "renders" LiDAR point clouds into colorful BEV images, uses point prompts to segment objects, and analytically maps 2D masks to 3D boxes. Although the mAP is lower than supervised methods, the approach is notable for being genuinely training-free and requiring only the frozen 2D SAM.
- SAM3D for 3D Scene Segmentation (Yang et al., 2023): Fuses per-frame 2D SAM masks into 3D point clouds, employing bottom-up, bidirectional merging of overlapping masks to form consistent 3D instances. The method is entirely promptless, free of additional 3D training, and modular.
- SAMPro3D (Xu et al., 2023): Prompts are sampled in the 3D point cloud, projected into multiple RGB–D frames, and consensus masks are merged through prompt-guided filtering and instance consolidation. The method outperformed supervised Mask3D on ScanNet200 ([email protected] = 82.60% vs. 79.03%).
4. Advanced SAM 3D Part and Material Segmentation
Beyond basic scene/object masks, SAM 3D methods have been developed for fine-grained part segmentation and editing.
- SAMPart3D (Yang et al., 11 Nov 2024): Leverages 2D–3D distillation, using DINOv2/FeatUp features to align a 3D part-aware backbone and a scale-conditioned grouping field. Multi-view VLMs yield part semantics, while multi-granularity segmentation supports fine-to-coarse decomposition. Achieves strong zero-shot mIoU (PartObj-Tiny: 53.7%) and interactive segmentation.
- P3-SAM (Ma et al., 8 Sep 2025): A native 3D, point-promptable mask generator using a PointTransformer backbone with multi-head, two-stage segmentation and IoU prediction. Fully automatic mask merging yields state-of-the-art instance and part segmentation (e.g., PartNetE mIoU: 65.4% vs. SAMPart3D 56.2%).
- SAMa (Fischer et al., 28 Nov 2024): Adapts SAM2’s video memory to produce multi-view–consistent material similarity maps on sparse views, back-projecting into a 3D point cloud for efficient, continuous, and optimization-free material editing across meshes, NeRFs, and 3D Gaussians. Fast (≈2 s to build similarity cloud, 10 ms/view mask), multiview-consistent, and robust to occlusions.
5. Training-Free and Open-Vocabulary 3D Scene Understanding
- OV-SAM3D (Tai et al., 24 May 2024): Introduces a fully training-free, open-vocabulary framework uniting SAM masks, RAM (image-level open tag recognition), and CLIP for multimodal 3D labeling. An initial 3D segmentation is generated by over-segmentation and SAM mask refinement, then merged by overlapping score tables and labeled via CLIP-matched RAM tags. State-of-the-art zero-shot AP on ScanNet200 and nuScenes is demonstrated (e.g., ScanNet200, AP=9.0 on val, surpassing SAI3D and OpenMask3D zero-shot baselines).
6. Generative 3D Reconstruction: SAM 3D in Single-Image 3D
The SAM 3D system for generative single-image 3D reconstruction (Team et al., 20 Nov 2025) models the distribution where is an input image and is the SAM mask. Its pipeline involves:
- Geometry estimation via a large latent-flow transformer working on coarsely-voxelized shape and layout (rotation, translation, scale).
- Texture synthesis and geometry refinement via a dedicated transformer, decoded into mesh or 3D Gaussian representations.
- A multi-stage, large-scale training logic: synthetic pretraining, semi-synthetic augmentations (render-paste, occlusion completion), supervised and preference-based refinement (alignment with human/model-in-the-loop selection). Empirical results show clear preference in reconstruction fidelity over prior SOTA ([email protected]: 0.2344 vs. Trellis' 0.1475; >5:1 human preference win rate).
7. Limitations, Open Challenges, and Future Directions
Observed challenges and current limitations across the SAM 3D literature include:
- Domain Gaps: SAM's base weights are not optimized for the imaging properties of medical or LiDAR data, impacting mask fidelity for small, low-contrast, or anisotropic structures.
- Prompt Burden and Automation: Manual prompt requirements can constrain clinical usability and scalability. Fully automated prompt or label generation (AutoProSAM, OV-SAM3D) and synthesis via external models (CLIP, RAM, TotalSegmentator) partially address this.
- Memory/Speed Tradeoffs: Native 3D extensions incur higher RAM and latency, particularly for large volumes; on-the-fly embedding and memory banks (Memorizing SAM) offer low-overhead, high-fidelity gains.
- Temporal/Volumetric Consistency: Models like SAM 2, SA3D, and SAMa address cross-frame and multi-view consistency in propagation or reconstruction, but decay and misalignment remain issues for long volumes.
- Open-Vocabulary and Semantic Assignment: Multimodal region labeling (e.g., via CLIP, ChatGPT + RAM) remains brittle for rare or underrepresented categories and non-visual semantics.
- Evaluation: Many published methods lack large-scale quantitative benchmarks and focus on interactive feasibility or a limited set of anatomical classes.
Strong directions for future work include memory-adaptive attention for ultra-long volumes, volumetric prompt and label distillation, closed-loop semi-supervised pipelines that couple segmentation with downstream tasks (registration, diagnosis), and benchmarks for 3D open-vocabulary labeling across perceptual domains.
Key References
- "SAMM (Segment Any Medical Model): A 3D Slicer Integration to SAM" (Liu et al., 2023)
- "RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation" (Gao et al., 7 Dec 2024)
- "SAMPart3D: Segment Any Part in 3D Objects" (Yang et al., 11 Nov 2024)
- "SAM3D: Zero-Shot Semi-Automatic Segmentation in 3D Medical Images with the Segment Anything Model" (Chan et al., 10 May 2024)
- "AutoProSAM: Automated Prompting SAM for 3D Multi-Organ Segmentation" (Li et al., 2023)
- "P3-SAM: Native 3D Part Segmentation" (Ma et al., 8 Sep 2025)
- "SAM 3D: 3Dfy Anything in Images" (Team et al., 20 Nov 2025)
- "Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding" (Tai et al., 24 May 2024)
- "SAM3D: Segment Anything Model in Volumetric Medical Images" (Bui et al., 2023)
- "3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation" (Gong et al., 2023)
- "TAGS: 3D Tumor-Adaptive Guidance for SAM" (Li et al., 21 May 2025)
- "Segment anything model 2: an application to 2D and 3D medical images" (Dong et al., 1 Aug 2024)
- "Memorizing SAM: 3D Medical Segment Anything Model with Memorizing Transformer" (Shao et al., 18 Dec 2024)
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free