MedSAM2: Prompt-Driven Medical Segmentation

Updated 12 November 2025

MedSAM2 is a suite of methods that adapts SAM2 for medical imaging by integrating prompt-based propagation and memory attention for volumetric segmentation.
It leverages a hierarchical Vision Transformer with a prompt encoder and mask decoder to accurately propagate a single annotated mask across 3D image volumes.
Domain-adaptive training with adapter fine-tuning and massive multi-modal pretraining significantly reduces annotation time while achieving state-of-the-art DSC and IoU metrics.

MedSAM2 is a term applied to a suite of methods and models that adapt the Segment Anything Model 2 (SAM2)—originally developed for zero-shot image and video segmentation—to the rigorous demands of multi-modal, volumetric, and highly data-diverse medical image segmentation. These models leverage SAM2’s memory-based propagation and promptable inference to address annotation bottlenecks, few-shot settings, and domain generalization across modalities such as MRI, CT, ultrasound, and specialized medical videos. Significant MedSAM2 variants include nnSAM2, MedSAM2 foundation models, SAMed-2, and closely related prompt-refinement and memory-adaptive pipelines.

1. Foundational Model Architecture and Prompt Propagation

MedSAM2 variants uniformly inherit the core architectural split of SAM2: a hierarchical Vision Transformer (Hiera-based) image encoder, a prompt encoder (handling points, boxes, scribbles, or masks), a memory attention mechanism with temporal/volumetric context, and a lightweight transformer-based mask decoder. For volumetric or 3D data, each 2D medical slice is treated as a “frame,” enabling prompt-based mask propagation by modeling anatomical continuity.

A defining capability is “one-prompt few-shot” segmentation: a single annotated mask or prompt provided on a reference slice can be propagated throughout the volume using the memory attention pipeline. For example, in nnSAM2, one expert-annotated 2D mask per dataset is used as the prompt for SAM2, which then generates volumetric pseudo-labels with confidence scores. These pseudo-labels are crucial for downstream filtering and refinement operations (Zhang et al., 7 Oct 2025).

Advanced MedSAM2 frameworks also introduce specialized memory or tracking heads. The SLM-SAM2 variant incorporates both short-term (nearest previous slice) and long-term (sequence of most confident predictions) memory banks, each attending independently to promote both smooth global context and rapid adaptation to abrupt anatomical changes, suppressing the principal error mode of over-propagation into target-absent slices (Chen et al., 3 May 2025). SAMed-2 further introduces a confidence-driven memory bank with selective update and retrieval, alongside temporal adapters in each transformer block to reinforce inter-slice correlations specific to medical imagery (Yan et al., 4 Jul 2025).

2. Domain-Specific Adaptation and Training Strategies

While the default SAM2 pipeline provides strong zero-shot generalization on “natural” domains, MedSAM2 variants universally demonstrate the benefit of domain-adaptive training. Adaptations include:

Adapter-Augmented Fine-Tuning: Lightweight adapters (DWConvAdapter, CNN-Adapter, or channel/spatial adapters) are inserted into the transformer blocks and mask decoder, and trained on large-scale, multimodal medical datasets. Only adapter weights (typically 2–5% of backbone parameters) are updated to prevent catastrophic forgetting and preserve generality (Cheng et al., 2023, Xie et al., 4 Feb 2025, Yan et al., 4 Jul 2025).
Prompt Diversity and Simulation: MedSAM2 models are trained on a curriculum of prompt types—single/box/point/mask—often simulating multi-step interactive segmentation, which matches medical annotation workflows.
Massive, Multi-Modal Pretraining: Datasets such as MedBank-100k, containing over 100,000 image–mask pairs across seven modalities (CT, MR, US, x-ray, fundus, dermoscopy, echocardiography), drive multi-task and continual learning required for medical deployment (Yan et al., 4 Jul 2025), while SAM-Med2D leverages over 4.6M images distributed across 31 organ/structure types (Cheng et al., 2023).

Key training losses include Dice, binary cross-entropy, focal loss, and explicit calibration losses on predicted mask confidence or IoU, which are essential for memory filtering and robust test-time behavior.

3. Prompt, Tracking, and Memory Mechanisms

The MedSAM2 ecosystem is characterized by architectural innovations in prompt encoding, memory management, and mask tracking:

Bidirectional Memory Attention: Memory banks, originally designed for unidirectional temporal propagation in video, are reengineered to accommodate bidirectional anatomical coherence between slices; models such as SAM2-3dMed add explicit modules for bidirectional Slice Relative Position Prediction, directly optimizing for anatomical adjacency (Yang et al., 10 Oct 2025).
Confidence-Driven Memory Management: Confidence metrics, derived from mask probability or IoU predictions, are integral for selective storage, retrieval, and update in memory banks, as in SAMed-2 and MedSAM-2 (Yan et al., 4 Jul 2025, Zhu et al., 1 Aug 2024).
Boundary Detection Enhancements: To address a frequent failure mode of over-segmentation or under-segmentation at indistinct anatomical boundaries, dedicated boundary detection (BD) modules or explicit loss terms are introduced, fusing memory-augmented feature maps for localized boundary sharpening (Yang et al., 10 Oct 2025).
Prompt Generation and Refinement Loops: Automated UNet-based prompt generators supply initial prompts for SAM2, with subsequent cycles of memory-based mask refinement, yielding superior accuracy even in minimal-supervision (e.g., one annotated slice per volume) settings (Xie et al., 4 Feb 2025, Zhang et al., 7 Oct 2025).

4. Quantitative Evaluation and Statistical Equivalence

Robust performance assessment is a hallmark of MedSAM2 research. Common evaluation metrics across works include the Dice Similarity Coefficient (DSC),

$\mathrm{DSC}(G,P) = \frac{2\,|G \cap P|}{|G| + |P|}$

and Intersection-over-Union (IoU), Hausdorff Distance, and Normalized Surface Dice (NSD).

Advanced works perform rigorous statistical equivalence testing using Two One-Sided Tests (TOST), assessing whether automated measurements (e.g., muscle volume, fat ratio, CT attenuation) are clinically indistinguishable from expert manual segmentation—a crucial property for actual deployment (Zhang et al., 7 Oct 2025). Intraclass Correlation Coefficient (ICC) is reported to quantify measurement reliability, with ranges up to 0.86–1.00 in leading pipelines.

Performance highlights include:

Model/Setting	Modality/Task	DSC (%)	Notable Finding
nnSAM2 (Zhang et al., 7 Oct 2025)	Lumbar muscle (MR/CT)	94–96	0.86–1.00 ICC, statistically equiv. metrics
MedSAM2 (Ma et al., 4 Apr 2025)	Multiorgan CT/MR/PET	87–89	>85% reduction in annotation effort
SAMed-2 (Yan et al., 4 Jul 2025)	Internal/external avg.	69–89	Large gains from confidence-driven memory
SAM2-3dMed (Yang et al., 10 Oct 2025)	Spleen (MSD task)	97.3	Outperforms prior SOTA by 0.5–2% Dice

A consistent theme is narrowing or surpassing the accuracy gap to nnUNet and specialist architectures, especially as propagation and refinement techniques become more sophisticated.

5. Open-Source Tooling, Practical Workflows, and Deployment

MedSAM2 models are integrated into widely used clinical and research platforms via dedicated plug-ins and code releases. 3D Slicer extensions such as SegmentWithSAM enable interactive annotation of 3D volumes by placing point prompts on a single or a few slices, with downstream propagation using the video (memory) mode of SAM2 (Yildiz et al., 27 Aug 2024, Ma et al., 4 Apr 2025). The workflow covers:

Volume loading and pre-embedding computation.
User placement of point or box prompts on selected slices.
Run 2D (single frame) or propagate (bi- or uni-directional) segmentation throughout the volume.
Iterative correction and re-propagation as needed.

Human-in-the-loop pipelines incorporating MedSAM2 have demonstrated annotation time reductions of >85%. For example, annotating CT lesions in DeepLesion: from a manual baseline of ~526 s per lesion, three rounds of MedSAM2-assisted refinement reduced time to 74 s (Ma et al., 4 Apr 2025).

Python APIs, command-line scripts, and cloud-hosted inference endpoints (e.g., based on Gradio) extend accessibility for both batch and interactive segmentation scenarios.

6. Limitations, Challenges, and Future Directions

Despite substantial advances, current MedSAM2 solutions face characteristic limitations:

Prompt Type Constraints: Some models are limited to bounding-box prompts, underperforming on thin/branching targets or highly irregular morphology.
Memory Length and Drift: Fixed memory-bank lengths (e.g., 8 slices/frames) may not capture extended anatomy or abrupt transitions, leading to error accumulation or over-propagation.
Low-Contrast and Noisy Data: Both domain gap (e.g., between natural and medical image texture) and extreme intensity variation challenge prompt propagation and fine-tuning. Failure modes include drift into adjacent anatomy and inconsistent boundary detection—typically addressed by negative prompts or boundary-specific modules.
Semantic Segmentation Limitation: Most models still output binary masks, lacking native support for semantic labels without additional customization.
Computational Demands: Although small (Tiny) backbones and adapter-based updates minimize overhead, advanced memory mechanisms (dual bank, confidence filtering) increase inference complexity.

Continued directions include:

General-purpose prompt encoders supporting boxes, points, and scribbles.
Dynamic/adaptive memory management and learning-based memory gating.
Explicit semantic heads enabling multi-class segmentation.
Domain adaptation for seldom-seen modalities or pathologies.
Support for lightweight deployment (quantization, CPU/edge inference).
Interactive pipelines for on-the-fly prompt refinement.
Advanced calibration and uncertainty estimation for error flagging.

7. Significance for Medical Image Analysis

MedSAM2 marks a pivotal transition from task-specific supervised approaches to highly flexible, prompt-driven, and memory-augmented segmentation paradigms. By framing medical volumes as videos and exploiting memory-based context, these models offer expert-level accuracy from as little as a single prompt per volume, broad multi-modality generalization, and dramatic efficiency improvements for annotators. Rigorous, statistically-grounded validation demonstrates their suitability for large-scale, reproducible studies in radiomics, morphometry, and computer-aided diagnosis, with open code bases accelerating field-wide adoption and further research (Zhang et al., 7 Oct 2025, Ma et al., 4 Apr 2025, Yan et al., 4 Jul 2025).