MedSAM-2: Unified 3D Medical Segmentation

Updated 4 June 2026

MedSAM-2 is a foundation model that unifies 3D medical image and video segmentation using prompt-based transformer architecture and advanced memory mechanisms.
It introduces innovations such as a self-sorting memory bank and cross-attention fusion block to improve mask decoding and handle complex anatomical variations.
The model achieves a new Pareto optimum by delivering high segmentation accuracy with low computational cost, reducing annotation time by over 85% in clinical workflows.

MedSAM-2 is a foundation model designed to unify 3D medical image and video segmentation by leveraging the prompt-based video segmentation paradigm established by Segment Anything Model 2 (SAM-2). MedSAM-2 incorporates domain-specific architectural modifications and advanced memory mechanisms to achieve state-of-the-art segmentation accuracy and efficiency across a broad spectrum of medical imaging tasks, including complex tasks such as tumor delineation in lung CT and volumetric segmentation in challenging modalities. The model establishes a new accuracy/computational cost Pareto optimum and enables label-efficient workflows adaptable to diverse clinical environments (Ayllón et al., 2 May 2025, Ma et al., 4 Apr 2025, Zhu et al., 2024).

1. Architectural Innovations

MedSAM-2 inherits the core encoder–decoder transformer paradigm of SAM-2 but includes key adaptations for medical volumetric data. The model recasts 3D image segmentation as a video object tracking problem, treating volumetric slices or video frames as temporally ordered inputs to capture both spatial and slice-to-slice (or frame-to-frame) coherence (Ayllón et al., 2 May 2025, Zhu et al., 2024). The principal model advances include:

Self-Sorting Memory Bank: At each decoding step, a dynamic set of learned embeddings $\{E_i\}$ is maintained, where the $k$ most relevant memory entries are selected by similarity to the current slice features. This context-aware conditioning improves mask decoding for structures with substantial anatomical or morphological heterogeneity (Ayllón et al., 2 May 2025, Zhu et al., 2024).
Cross-Attention Fusion Block: The mask decoder integrates the aggregated memory embeddings $E_\mathrm{mem} \in \mathbb{R}^{k \times C}$ with the per-slice features $F \in \mathbb{R}^{C \times H \times W}$ via a cross-attention mechanism:

$A = \mathrm{softmax}((W_q F)^\top (W_k E_{\mathrm{mem}})) \in \mathbb{R}^{(HW) \times k}, \qquad O = \mathrm{reshape}(A \cdot (W_v E_{\mathrm{mem}}), C, H, W)$

The resulting output $O$ is added to $F$ prior to mask prediction (Ayllón et al., 2 May 2025).

Prompt Integration and Propagation: MedSAM-2 accommodates bounding box and point prompts, propagates prompt information across slices, and maintains context via the memory bank during mask propagation (Ma et al., 4 Apr 2025).
No Novel Convolutional Block: Beyond the cross-attention and memory mechanisms, no new convolutional architectures are introduced (Ayllón et al., 2 May 2025).
Highly Modular Implementation: The memory bank, cross-attention fusion, and prompt encoder are modular and extensible, supporting integration into platforms such as 3D Slicer or Gradio (Yildiz et al., 2024, Ma et al., 4 Apr 2025).

2. Training Paradigm and Loss Functions

MedSAM-2 training employs full fine-tuning—no parameter freezing is performed—on large-scale, curated medical datasets. The standard optimization pipeline is:

Loss Combination: The principal objective combines pixel-level binary cross-entropy and Dice loss:

$L_{\mathrm{BCE}}(p, y) = -[y\log p + (1-y)\log(1-p)], \qquad L_{\mathrm{Dice}}(p, y) = 1 - \frac{2\sum_i p_i y_i}{\sum_i p_i + \sum_i y_i}$

with global loss $L = \lambda L_{\mathrm{BCE}} + (1-\lambda) L_{\mathrm{Dice}}$ , $\lambda \approx 0.5$ (Ayllón et al., 2 May 2025).

Optimizer: AdamW with default SAM-2 scheduler, no custom schedule (Ayllón et al., 2 May 2025, Ma et al., 4 Apr 2025).
Epochs and Augmentation: Trained for up to 1,000 epochs using random flips, crops, intensity jitter, and standard modality-specific normalization (e.g., CT windowing to [–1000,1000] HU and normalization to [0,1]) (Ayllón et al., 2 May 2025). Augmentation for balancing modalities is also applied in the large MedSAM2 cohort (Ma et al., 4 Apr 2025).
Prompt Variants: Both bounding box and point prompting are supported during training and fine-tuning, enabling flexible semi-automated annotation scenarios (Ayllón et al., 2 May 2025).

3. Datasets, Evaluation Protocols, and Metrics

MedSAM-2 has been benchmarked on diverse datasets and segmentation tasks:

Datasets:
- Lung1 (NSCLC-Radiomics): 304 CT volumes, preprocessed with isotropic resampling and intensity clipping (Ayllón et al., 2 May 2025).
- Task06 (MSD): 63 CT volumes, identically preprocessed (Ayllón et al., 2 May 2025).
- Extensive volumes for organs, lesions, and multiple modalities in the large MedSAM2 study (363k CT, 14.8k PET, 77k MRI volumes; 76k ultrasound and endoscopy frames) (Ma et al., 4 Apr 2025).
Protocol: Fine-tuning at fractions 0%, 25%, 50%, 75%, 100% of the training set; no freezing; evaluation on held-out test sets; use of both bounding box and point prompting modes (Ayllón et al., 2 May 2025).
Metrics:
- Dice Similarity Coefficient (DSC): $k$ 0
- Intersection over Union (IoU): $k$ 1
- Precision/Recall and others: Extended in broader MedSAM2 evaluation (Ma et al., 4 Apr 2025).

4. Quantitative Performance and Computational Efficiency

MedSAM-2 improves upon conventional and contemporary segmentation baselines in both accuracy and efficiency:

Method	Dice (Lung1)	Dice (Task06)	GMACs	Parameters	Notes
U-Net	0.053	0.0087	< 10	~30M	Non-competitive
DeepLabV3	0.0532	0.0087	< 10	~30M	Non-competitive
nnU-Net 2D	0.9039	0.8736	24,062	~40M	High cost
nnU-Net 3D full-res	0.7023	0.8487	118,194	~50M	Highest cost
MedSAM	0.6441	0.7230	–	–	Prior foundation
MedSAM-2 (point)	0.7974	0.7974	226	~64M	Single point prompt
MedSAM-2 (bbox)	0.9091	0.8770	226	~64M	SOTA, best trade-off

MedSAM-2 (bounding box) thus provides DSC up to 0.91 at only 226 GMACs and a moderate parameter count—achieving higher segmentation performance than all comparative baselines while being several orders of magnitude more efficient than nnU-Net 3D (Ayllón et al., 2 May 2025).

5. Prompting Mechanisms and Annotation Efficiency

MedSAM-2 generalizes prompt types and enables significant reduction in manual effort:

Prompt Types: Supports bounding boxes and single-point prompts as first-class prompt modalities (Ayllón et al., 2 May 2025).
Annotation Workflows: Empirical human-in-the-loop pipelines demonstrate that bounding box prompting with MedSAM2 (in 3D) can reduce annotation time by over 85% for large-scale lesion, organ, and clinical video segmentation, outperforming both baseline models and previous manual/interactive schema (Ma et al., 4 Apr 2025).
Adaptability to Novel Scenes: MedSAM-2 maintains robust performance when fine-tuned on as little as 25–50% of site-specific data, facilitating rapid deployment with minimal annotation burden (Ayllón et al., 2 May 2025).
Interactive Integration: The model is deployed within 3D Slicer extensions and Gradio interfaces, supporting point-and-click or bounding box interface for volumetric scans and interactive correction (Yildiz et al., 2024).

6. Clinical and Practical Implications

MedSAM-2’s improvements in segmentation accuracy and interaction efficiency have direct impacts:

Tumor Tracking and Volume Estimation: Enhanced accuracy supports precise volumetric tumor measurements for growth monitoring and treatment planning, enabling reduction in radiotherapy margins and sparing healthy tissue (Ayllón et al., 2 May 2025).
Workflow Integration: Low computational requirements (single A100 GPU, real-time response) render MedSAM-2 feasible for direct PACS integration and semi-automated contouring by radiologists, reducing inter-observer variability (Ayllón et al., 2 May 2025).
Label-Efficiency and Transferability: The ability to achieve near-optimal performance under partial fine-tuning or with limited site-specific data supports deployment in resource-limited environments or for rare anatomical/oncological cases (Ayllón et al., 2 May 2025).
Scaling to Clinical Annotation: Demonstrated >85% reduction in annotation time across multiple modalities and use cases, with several hundred thousand frames labeled in iterative human-in-the-loop refinement cycles (Ma et al., 4 Apr 2025).

7. Limitations and Future Directions

Despite strong performance, MedSAM-2 presents practical and open research challenges:

Prompt Type Limitations: The current pipeline relies primarily on bounding boxes; further extension to support fine-scale prompts such as negative points, scribbles, or multi-instance selection could enhance performance on complex, branching, or elongated structures (Ma et al., 4 Apr 2025).
Fixed Memory Size: The default fixed eight-frame/slice memory bank may struggle in highly non-linear motion (e.g., certain videos, highly variable anatomy), motivating research into adaptive or longer memory strategies (Ma et al., 4 Apr 2025).
Boundary and Label Uncertainty: No explicit uncertainty quantification is provided in the output, and boundary delineation in highly variable regions is not always robust without further interactive correction (Ma et al., 4 Apr 2025).
Real-time CPU Deployment: While GPU efficiency is high, extension to real-time CPU-based workflows will require further model compression, quantization, or distillation (Ma et al., 4 Apr 2025).

MedSAM-2 exemplifies the integration of prompt-based, memory-augmented video transformer architectures into the domain of volumetric medical image segmentation. Extensive benchmarking establishes its superiority in segmentation accuracy and efficiency for lung tumor CT, organs across modalities, and clinical videos, providing a foundation for accelerated clinical annotation, improved longitudinal tracking, and scalable deployment in multi-institutional workflows (Ayllón et al., 2 May 2025, Ma et al., 4 Apr 2025, Zhu et al., 2024).