MedSAM-2: Transformer for Medical Segmentation

Updated 23 September 2025

MedSAM-2 is a family of transformer-based models that reinterprets 2D and 3D medical image segmentation through architectural innovations, prompt automation, and enhanced memory mechanisms.
It integrates a hierarchical image encoder, advanced prompt encoder, and self-sorting memory attention module to achieve improved segmentation accuracy across various medical imaging modalities.
MedSAM-2 employs adapted fine-tuning, test-time adaptation, and human-in-the-loop workflows to ensure robust, efficient performance in both research and clinical applications.

Medical SAM 2 (MedSAM-2) is a family of transformer-based foundation models for medical image segmentation, designed to extend the flexibility, scalability, and prompt-driven paradigm of the Segment Anything Model 2 (SAM-2) to the medical imaging domain. MedSAM-2 and its derivatives reinterpret both 2D and 3D (volumetric) medical image segmentation through architectural innovations, prompt automation, memory mechanisms, and domain adaptation strategies, resulting in significant advances in universal, interactive, and automated medical segmentation across diverse imaging modalities and clinical tasks.

1. Architectural Foundations and Innovations

MedSAM-2 is fundamentally constructed on the SAM-2 architecture, which consists of a hierarchical vision transformer image encoder, a prompt encoder, a dynamic memory attention mechanism, and a mask decoder (Zhu et al., 1 Aug 2024, Ma et al., 4 Apr 2025). Key architectural principles and variant enhancements include:

Image Encoder: Based on hierarchical transformers (e.g., Hiera), adapted for medical image sizes (downscaling to 3×512×512 or similar). Supports both 2D images/slices and 3D video-like sequences, yielding multi-scale feature maps suitable for volumetric or temporally coherent tasks (Ma et al., 4 Apr 2025).
Prompt Encoder: Extends standard point and box prompts to support scribbles, learned prompt embeddings, and class-based semantic prompts. Several MedSAM-2 variants replace manual prompt encoding with automated or learnable prompt generation modules to eliminate or minimize user interaction (Gaillochet et al., 30 Sep 2024, Huang et al., 5 Feb 2025, Xie et al., 4 Feb 2025, Xing et al., 24 Jun 2025).
Memory Attention Module: Inherits the memory bank and memory attention design of SAM-2, enabling propagation of segmentation information across frames or slices. Enhanced variants include confidence-driven, self-sorting, or dual-branch memory (short-/long-term) modules to improve robustness to noisy frames, uncertain boundaries, and error accumulation in volumetric data (Zhu et al., 1 Aug 2024, Yan et al., 4 Jul 2025, Chen et al., 3 May 2025).
Mask Decoder: Fuses image, prompt, and memory-conditioned features to generate segmentation masks via cross-attention, upsampling, and (in advanced variants) additional calibration or semantic labeling heads.

These components are trained using combinations of loss functions such as Dice, Binary Cross-Entropy, and focal loss, sometimes augmented by uncertainty-aware optimization and multi-scale consistency objectives (Huang et al., 5 Feb 2025, Ma et al., 4 Apr 2025, Wu et al., 5 Jun 2025).

2. Memory Mechanisms and Propagation Strategies

MedSAM-2 exploits video-object-tracking analogies to treat volume or batch image segmentation as frame-wise propagation problems. Central innovations:

Self-sorting (Confidence-first) Memory Bank: Rather than FIFO storage, templates with highest confidence and maximal feature dissimilarity are preserved, ensuring memory is both reliable and diverse (Zhu et al., 1 Aug 2024, Yan et al., 4 Jul 2025).
Short-Long Memory: Introduction of two separate banks—short-term (adapts to abrupt changes, e.g. object boundary appearances/disappearances) and long-term (preserves stable anatomical context)—with dedicated attention modules and fusion mechanisms (Chen et al., 3 May 2025).
Temporal Adapter and Confidence-driven Retrieval: Temporal adapters in the image encoder efficiently aggregate context across contiguous slices or frames, while confidence-driven retrieval and selective memory replacement retain only high-certainty exemplars, which helps mitigate noise and continual learning challenges (Yan et al., 4 Jul 2025).

These memory-based modules are particularly critical in 3D medical segmentation, where the differences between adjacent slices can be substantial, and anatomical boundaries may be ambiguous or absent in some frames.

3. Prompt Automation, Semantic Labeling, and Supervision

MedSAM-2 tackles the user interaction and semantic labeling bottleneck via multiple mechanisms:

Learnable Prompt Embeddings: Lightweight modules trained on weak, few-shot supervision (e.g., tight bounding boxes) generate dense and sparse prompt embeddings from image features in a plug-and-play fashion, enabling segmentation automation with minimal annotation (Gaillochet et al., 30 Sep 2024, Xie et al., 4 Feb 2025).
Support-Set Guided Prompting and Pseudo-mask Generation: The support set approach retrieves similar annotated examples, generates pseudo-masks via attention alignment, and auto-derives bounding box prompts, eliminating the need for explicit human annotation at inference (Xing et al., 24 Jun 2025).
Diffusion-based Class Prompt Encoders: Advanced models such as AutoMedSAM use diffusion processes to encode class labels as global and local prompt embeddings, enabling fully automated segmentation with direct organ or lesion semantic labeling (Huang et al., 5 Feb 2025). These approaches decouple mask generation from low-level annotation and produce task-relevant, interpretable results.
Semi-supervised Prompt Generation: Physical constraints with sliding window (PCSW) and spatial continuity rules are used to extract pseudo-prompts from unlabeled data, allowing semi-supervised learning regimes that combine limited annotation with large unlabeled datasets (Zhu et al., 10 Jun 2025).

These advances allow MedSAM-2 to operate at scale with dramatically reduced annotation cost, facilitating universal and adaptive segmentation pipelines.

4. Performance Evaluation and Benchmarking

MedSAM-2 and its variants have been extensively benchmarked on challenging medical datasets, spanning modalities (CT, MRI, PET, X-ray, ultrasound, fundus, endoscopy, echocardiography, and microscopy), anatomical sites, and pathology types (Zhu et al., 1 Aug 2024, Ma et al., 4 Apr 2025, Yan et al., 4 Jul 2025). Key quantitative metrics include:

Dice Similarity Coefficient (DSC):

$\text{DSC} = \frac{2|A \cap B|}{|A|+|B|}$

where $A$ is the predicted segmentation and $B$ is the ground truth.

Hausdorff Distance and Normalized Surface Distance (NSD): Evaluating boundary accuracy and region overlap.

Representative results:

Model	Dataset	DSC	Annotation Reduction
MedSAM-2	BTCV	0.8857	-
MedSAM-2	LA segmentation	0.81 ± 0.05	>85% cost reduction
SAMed-2	MedBank-100k	0.7118–0.6938	87.6% time saved
SAM2-SGP	AMOS22 (CT)	0.903	-

Efficiency: MedSAM-2 achieves 3D segmentation in ~20 seconds per scan versus several minutes for traditional slice-by-slice prompting (Mehrnia et al., 8 Nov 2024, Ma et al., 4 Apr 2025).
Comparison to Baselines: In several tasks, MedSAM-2 and its derivatives surpass dedicated specialist models (e.g., nnUNet), and outperform the original SAM and MedSAM by substantial Dice margins (Xie et al., 4 Feb 2025, Xing et al., 24 Jun 2025).

5. Adaptation to Medical Domain and Generalization

Because of domain shift between natural and medical images, MedSAM-2 employs several strategies:

Fine-tuning and Low-Rank Adaptation (LoRA): Instead of retraining the entire model, LoRA modules are integrated to enable efficient, stable adaptation of the image encoder to medical texture, channel, and contrast distributions (Xing et al., 24 Jun 2025, Huang et al., 5 Feb 2025).
Test-Time Adaptation (SAM-TTA): Unsupervised, inference-time adaptation modules—such as Self-adaptive Bezier Curve-based Transformation (SBCT) and Dual-scale Uncertainty-driven Mean Teacher (DUMT)—mitigate both input-level and semantic-level discrepancies, improving performance on previously unseen or single-channel modalities (Wu et al., 5 Jun 2025).
Confidence-driven Memory and Continual Learning: Memory pruning and selective updating help avoid catastrophic forgetting and preserve accumulated knowledge as the model is exposed to new data or tasks (Yan et al., 4 Jul 2025).
Semi-supervised and Few-shot Learning: SSS and related models leverage weak supervision, prompt synthesis, and discriminative feature enhancement to support robust segmentation in limited-label, high-unlabeled-data regimes (Zhu et al., 10 Jun 2025).

6. Interactive, Automated, and Human-in-the-loop Workflows

MedSAM-2’s prompt and memory infrastructure facilitates a range of practical workflows for research and clinical settings:

One-Prompt and Eye Gaze-based Segmentation: One-prompt mode enables propagation of segmentation across large datasets from a single user interaction, while gaze-based interfaces exploit eye-tracking to generate prompts rapidly and non-intrusively, offering a trade-off between efficiency and peak accuracy (Shmykova et al., 21 May 2025, Zhu et al., 1 Aug 2024).
Human-in-the-loop Annotation: MedSAM-2 pipelines integrated with platforms like 3D Slicer, Gradio, and JupyterLab support iterative annotation, user correction, and batch processing, resulting in annotation time reductions exceeding 85% in large-scale studies (e.g., CT lesions, MRI liver tumors, echocardiography videos) (Ma et al., 4 Apr 2025).
Fully Automated Pipelines: With learned or class-based prompt modules and pseudo-mask attention, MedSAM-2 can produce semantic segmentations with no human interaction, broadening accessibility for non-expert deployment (Huang et al., 5 Feb 2025, Xing et al., 24 Jun 2025).

7. Future Directions and Open Challenges

Current and anticipated developments for MedSAM-2 focus on:

Enhanced Prompt Diversity: Research is investigating alternate prompt modalities (text, scribble, lasso, multi-point), improved 3D/4D spatial modeling, and adaptively expanding memory capacity (Ma et al., 4 Apr 2025, Chen et al., 3 May 2025).
Robustness and Generalization: Strategic adoption of test-time adaptation, domain-invariant training, self-supervision, and multi-task continual learning for translational robustness across institutions and imaging protocols (Wu et al., 5 Jun 2025, Yan et al., 4 Jul 2025).
Efficiency and Edge Deployment: Ongoing optimization for inference on limited hardware through model compression, quantization, and knowledge distillation.
Automated and Semantic Annotation: Broader adoption of diffusion-based class prompt learning, automated pseudo-label generation, and hybrid prompt strategies for truly universal, interpretable clinical segmentation workflows (Huang et al., 5 Feb 2025, Xing et al., 24 Jun 2025).
Integration with Clinical Workflows: Expansion into assisted diagnosis, surgical planning, and real-time guidance via user-friendly, validated tools integrated with radiological and diagnostic systems (Ma et al., 4 Apr 2025, Mehrnia et al., 8 Nov 2024).

These directions highlight MedSAM-2 as a central component of the emerging ecosystem of medical segmentation foundation models, with active code repositories and datasets supporting open science and collaborative development across the medical AI community.