SAM2-3dMed: 3D Medical Image Segmentation

Updated 25 March 2026

The paper presents SAM2-3dMed as a framework that adapts a video segmentation model to 3D medical imaging by modeling bidirectional inter-slice dependencies for enhanced performance.
It incorporates a Slice Relative Position Prediction module that enforces self-supervised spatial ordering, significantly improving the capture of anatomical continuity.
A Boundary Detection module is integrated to refine contour delineation, leading to state-of-the-art segmentation metrics on CT datasets.

SAM2-3dMed is an advanced framework for 3D medical image segmentation, designed to adapt the Segment Anything Model 2 (SAM2)—originally developed for video object segmentation—to the challenging spatial and anatomical characteristics of medical imaging. By explicitly modeling bidirectional inter-slice dependencies and enhancing boundary precision, SAM2-3dMed sets a new standard in volumetric segmentation, narrowing the long-standing performance gap between foundation models trained on natural images or videos and specialized 3D medical segmentation networks.

1. Motivations and Theoretical Basis

Conventional foundation models like SAM2 are optimized for temporally ordered data, such as videos, where unidirectional propagation and temporal-spatial coherence suffice. However, 3D medical images (e.g., CT, MRI) are fundamentally spatial: (1) anatomical structures may appear and disappear non-monotonically across slices; (2) precise boundary delineation is indispensable for clinical interpretation and quantitative analysis. The two primary domain gaps are the need for explicit bidirectional continuity and fine-grained contour modeling, neither of which is directly addressed by off-the-shelf video-centric models. SAM2-3dMed addresses these gaps by introducing the Slice Relative Position Prediction (SRPP) module and an auxiliary Boundary Detection (BD) branch, integrated into a SAM2 backbone framework (Yang et al., 10 Oct 2025).

2. Architectural Innovations

2.1. Main SAM2-based Segmentation Module

The framework processes an input 3D volume $X \in \mathbb{R}^{3 \times D \times H \times W}$ , treating it as an ordered sequence of $D$ slices. Each slice is encoded independently by the frozen SAM2 Image Encoder, yielding $Z \in \mathbb{R}^{C \times D \times H' \times W'}$ . Prompt information (e.g., bounding box or point, if provided) is incorporated, and the base SAM2 memory-attention and mask-decoder pipeline generates an initial volumetric mask $\mathbf{P}_{seg} \in \mathbb{R}^{K \times D \times H \times W}$ .

2.2. Slice Relative Position Prediction (SRPP) Module

SRPP injects bidirectional spatial context by enforcing that the network learns to predict the relative positional offsets between arbitrary pairs of slices. Given Z_i, Z_j (features for slices i, j), SRPP outputs $(P_{pos})_{i,j}$ predicting $GT_{pos} = j-i$ . This is enabled by a lightweight Transformer encoder followed by an MLP, minimized by an MSE loss over all slice pairs: $L_{srpp} = \frac{1}{D(D-1)} \sum_{i=1}^D \sum_{j=1, j \neq i}^D \left((P_{pos})_{i,j} - (GT_{pos})_{i,j}\right)^2$ This self-supervised task encourages the feature space to encode inter-slice ordering and directional continuity, rather than only frame-local properties.

2.3. Boundary Detection (BD) Module

The BD module improves mask boundary precision by introducing explicit, supervised learning of contour voxels. Ground-truth boundaries $GT_{bd}$ are generated via morphological gradients. The BD branch fuses main features and boundary-specific queries via cross-attention, with boundary predictions $P_{bd}$ refined through an MLP and upsampled via a convolutional head. The segmentation loss is augmented with a weighted binary cross-entropy over boundary and non-boundary voxels: $L_{bd} = \frac{N_{non-bd}}{N} \sum_{j \in \Omega_{bd}} \mathrm{BCE}(P_{bd_j}, GT_{bd_j}) + \frac{N_{bd}}{N} \sum_{j \in \Omega_{non-bd}} \mathrm{BCE}(P_{bd_j}, GT_{bd_j})$ where $N_{bd}, N_{non-bd}$ denote the number of boundary and non-boundary voxels, respectively.

2.4. Training Objective

The total loss combines the main segmentation objective (slice-wise cross-entropy), SRPP, and BD losses: $L_{total} = L_{seg} + \lambda_1 L_{srpp} + \lambda_2 L_{bd}$ The coefficients $\lambda_1, \lambda_2$ are optimized to balance regional overlap and boundary precision.

3. Training Protocol and Datasets

Training employs diverse datasets from the Medical Segmentation Decathlon: CT lung tumor, spleen, and pancreas, with each volume split 80%/20% train/test (Yang et al., 10 Oct 2025). Preprocessing includes CT windowing, normalization, organ-focused cropping, and resizing to 512×512 pixels. Data augmentations encompass horizontal flipping, affine deformation, Gaussian blur/noise, and random mosaics. SAM2’s image encoder is frozen during training to preserve pre-trained general image representations, while the segmentation, SRPP, and BD heads are updated. The optimizer is AdamW with weight decay, using a cosine learning rate schedule. Training typically uses a batch size of 4 volumes, with early stopping on validation accuracy.

4. Empirical Evaluation and Comparative Performance

The main evaluation metrics are Dice similarity coefficient (DSC), Intersection-over-Union (IoU), Normalized Surface Dice (NSD), and 95th percentile Hausdorff distance (HD95). Quantitative results demonstrate that SAM2-3dMed achieves state-of-the-art performance across all evaluated targets:

Lung (MSD): Dice = 0.7627, IoU = 0.6544, HD95 = 3.51, NSD = 0.8197
Spleen (MSD): Dice = 0.9727, IoU = 0.9471, HD95 = 1.62, NSD = 0.9742
Pancreas (MSD): Dice = 0.7039, IoU = 0.5706, HD95 = 14.92, NSD = 0.6196

These results surpass variants of MedSAM-2, fine-tuned SAM2, nnU-Net, MedNeXt, and nnFormer. Ablation studies reveal that both the SRPP and BD modules are critical for peak performance: omitting SRPP yields notable Dice and HD95 drops, especially for anatomically irregular regions, and disabling BD degrades boundary precision and volume overlap.

5. Functional Significance of SRPP and Boundary Modules

SRPP plays a vital role by encoding bidirectional anatomical context—crucial for resolving ambiguities at object termini and enabling robust handling of structures that may partially or intermittently appear along the scan axis. The BD branch directly targets clinical requirements for sharp, morphologically precise segmentation, substantially improving HD95 and NSD. Notably, BD-fused features sharpen mask delineation in anatomically complex areas and reduce spurious outlier voxels in the reconstructed 3D mask stack.

6. Analysis, Limitations, and Future Directions

Experiments confirm that SAM2-3dMed’s design resolves key domain mismatches inherent in video-centric segmentation foundation models: it captures anatomical continuity and achieves boundary-level accuracy required for medical image analysis. The SRPP module’s self-supervised training is low-cost and annotation-free, and the BD module is readily generalizable to new anatomical targets or imaging modalities.

Principal limitations include the computational and memory demands that scale with the number of slices and feature-map size, potential overfitting if the backbone is entirely unfrozen, and model validation thus far being limited to contrast-enhanced CT data. Future enhancements could involve adapting the architecture to MRI or multi-modal fusion scenarios, exploring other volumetric self-supervised tasks (e.g., rotation or landmark prediction), and investigating dynamic attention scaling for ultra-large volumes.

7. Context within the Medical AI Foundation Model Ecosystem

SAM2-3dMed occupies a distinct position among recent domain-adapted foundation models. Unlike adaptations that simply treat a volume as a video (MedSAM-2 (Zhu et al., 2024), BioSAM-2 (Yan et al., 2024)) or focus on prompt-engineered memory (SLM-SAM2 (Chen et al., 3 May 2025)), SAM2-3dMed uses self-supervised spatial ordering (SRPP) and supervised boundary cues to bridge anatomical and morphological modeling gaps (Yang et al., 10 Oct 2025). Its improvements over state-of-the-art approaches highlight both the necessity of domain-aware architectural changes and the potential of lightweight, auxiliary self-supervision in transferring video-centric models to spatially structured medical tasks. This offers a reproducible paradigm for future 3D adaptation of foundation models in the clinical context.