MedSAM-3: Promptable Medical Segmentation Model

Updated 26 November 2025

MedSAM-3 is a promptable text-guided segmentation model designed for accurate delineation in diverse medical images and videos.
It fine-tunes the SAM 3 architecture with a dedicated PCS head to integrate semantic and geometric prompts, enhancing precision and generalizability.
The model employs an agentic workflow with multimodal LLMs for iterative mask refinement, significantly boosting segmentation performance in clinical tasks.

MedSAM-3 is a promptable text-guided medical image and video segmentation model developed by fine-tuning the Segment Anything Model (SAM) 3 architecture on medical datasets with semantic concept supervision. It introduces a Promptable Concept Segmentation (PCS) head for open-vocabulary guidance and extends to an agentic framework via multimodal LLMs (MLLMs), enabling an agent-in-the-loop workflow with iterative mask refinement. MedSAM-3 achieves substantial gains in accuracy and generalizability across 14 diverse medical imaging modalities and establishes the feasibility of integrating multimodal reasoning into clinical segmentation tasks (Liu et al., 24 Nov 2025).

1. Model Architecture and Methodological Framework

MedSAM-3 is architecturally derived from SAM 3, which consists of a frozen Perception Encoder (PE) composed of a ViT‐B/16 visual backbone and a CLIP-style text encoder, a Detector Transformer that fuses image features with geometric and conceptual prompt embeddings, and a Mask Decoder employing cross-attention to produce segmentation masks. For video, a Tracker with streaming memory blocks supports temporal mask tracking.

The adaptation to MedSAM-3 involves "lightweight" fine-tuning: only the Detector and Mask Decoder are updated on medical data, paired with semantic clinical concept labels, while the PE remains frozen. This preserves general vision–language features while specializing the network to medical semantic segmentation (Liu et al., 24 Nov 2025).

A major architectural innovation is the PCS head, which comprises:

A 2-layer MLP for concept embedding projection from the frozen text encoder (from $\mathbb{R}^{D_{\textrm{text}}}$ to $\mathbb{R}^{D_{\textrm{proj}}}$ , $D_{\textrm{proj}} \approx 256$ )
Insertion of the projected embedding as a special “CLS-mask” token within the set of $K$ learnable mask queries
Cross-attention layers that enable joint attention over image tokens and concept + geometric prompt tokens
A convolutional mask head for upsampling to final mask resolution

The training objective combines pixelwise binary cross-entropy (BCE) and Dice losses:

$\mathcal{L}_{\text{BCE}}(P, G) = - \frac{1}{HW}\sum_{i=1}^{H}\sum_{j=1}^{W} \Bigl[G_{ij}\log P_{ij} + (1-G_{ij})\log(1-P_{ij})\Bigr]$

$\text{Dice}(P, G) = \frac{2\sum_{i,j} P_{ij} G_{ij}}{\sum_{i,j} P_{ij} + \sum_{i,j} G_{ij} + \varepsilon}, \quad \mathcal{L}_{\text{Dice}} = 1 - \text{Dice}(P, G)$

$\mathcal{L} = \lambda_{\text{BCE}} \mathcal{L}_{\text{BCE}} + \lambda_{\text{Dice}} \mathcal{L}_{\text{Dice}}, \quad \lambda_{\text{BCE}} = \lambda_{\text{Dice}} = 1$

This combination promotes both region-based agreement and gradient alignment between prediction and ground truth (Liu et al., 24 Nov 2025).

2. Text Promptable Concept Segmentation

MedSAM-3 enables open-vocabulary segmentation via integration of text and geometric prompts. The model leverages the CLIP-style text encoder, wherein any medical phrase (e.g., “optic disc”, “lung tumor”) is projected and treated as a prompt token alongside geometric inputs (points/boxes). Geometric prompts use spatial positional encodings to define a region-of-interest, while text prompts convey semantic intent; both are fused within the cross-attention mechanism of the PCS head.

MedSAM-3 supports:

Text-only inference (MedSAM-3 T): segmentation is guided exclusively by semantic grounding in the input phrase.
Text+geometric inference (MedSAM-3 T+I): combines semantic and spatial priors, yielding robust and precise mask delineation.

Empirical findings demonstrate that text+box T+I mode substantially outperforms text-only and geometric-only inference, especially for structures with ambiguous boundaries or varying appearances (Liu et al., 24 Nov 2025).

3. Agentic Workflow with Multimodal LLMs

The MedSAM-3 Agent framework integrates an MLLM (e.g., Gemini 3 Pro) with MedSAM-3 for agent-in-the-loop segmentation. The agent receives a clinical instruction and image input, reasons about segmentation subtasks, and generates prompts for MedSAM-3. The output masks are iteratively evaluated and refined through agent reasoning:

Input: Image I, Query Q
Initialize context C ← [I, Q]
Masks ← []
repeat T times:
  action ← MultimodalLLM.plan(C)
  if action.type == "segment":
    prompt ← action.prompt    # e.g. "segment the liver tumor"
    M ← MedSAM3.segment(I, prompt)
    Masks.append(M)
    feedback ← evaluate_mask(M, ground_truth=None)
    C.append((M, feedback))
  else if action.type == "finish":
    break
return Masks

Typically, 2–3 agentic refinement loops achieve marked improvement (e.g., Dice coefficient on BUSI from 0.7772 to 0.8064). The MLLM enables reasoning over complex structures, iterative error diagnosis, and refinement strategies by fusing multimodal feedback (Liu et al., 24 Nov 2025).

4. Experimental Design and Benchmarking

MedSAM-3 was extensively validated across 14 public datasets and multiple medical image modalities, including:

2D X-ray (COVID-QU-Ex)
Ultrasound (BUSI)
OCT (GOALS)
Fundus images (RIM-ONE)
Dermoscopy (ISIC 2018)
Histopathology (MoNuSeg)
Fluorescence microscopy (DSB 2018)
Infrared (RAVIR)
Endoscopy (Kvasir-SEG)
Video sequences (PolypGen)
3D CT (Parse2022, LiTS)
3D MRI (PROMISE12, ISLES 2024)

Training and evaluation followed official challenge splits where available, otherwise utilizing an 80/20 stratified split. Standard image augmentations (random flip, ±15° rotation, intensity jitter, cropping) and domain-specific fine-tuning datasets were applied to simulate few-shot generalization to new medical contexts (Liu et al., 24 Nov 2025).

5. Quantitative and Qualitative Performance

Benchmarking summarized below demonstrates that MedSAM-3 establishes state-of-the-art performance among generalist and specialist models across core 2D benchmarks:

Method	BUSI	RIM-ONE(cup)	ISIC2018	Kvasir-SEG
U-Net	0.7618	0.8480	0.8760	0.8244
MedSAM	0.7514	0.8479	0.9177	0.7657
SAM 3 T	0.0000	0.0000	0.2189	0.0000
SAM 3 T+I	0.7110	0.8303	0.8178	0.7671
MedSAM-3 T	0.2674	0.0826	0.5687	0.1441
MedSAM-3 T+I	0.7772	0.8977	0.9058	0.8831

Inclusion of the agentic MLLM loop further improves performance (BUSI Dice: 0.8064 with MedSAM-3 Agent and Gemini 3 Pro).

3D segmentation tasks remain challenging in text-only regimes; in LiTS, PROMISE12, ISLES 2024, MedSAM-3 T scores well below leading 3D specialist models (e.g., nn-U-Net Dice 0.7714 vs. SAM 3 T 0.1374).

Qualitative analysis highlights that MedSAM-3 T+I recovers irregular, low-contrast, or small targets, outperforming SAM 3. However, failures persist in text-only mode (empty or imprecise masks) and with over-general or semantically ambiguous prompts (Liu et al., 24 Nov 2025).

Preceding MedSAM variants exhibit limitations in segmenting small tissues and robustness to tight bounding box prompts. An adaptive perturbation strategy, involving parameterized shrinking and expanding of box prompts during training, was shown to reduce tiny-tissue inclusion errors from 18% to 2% and improve Dice coefficients by ~17 points under reduced bounding box conditions (Li et al., 25 Mar 2025). The perturbation is governed by global similarity $\lambda$ and schedulable factors $E_\text{shrink}, S_\text{expand}$ conditioned on object size, yielding robust scale- and location-invariant performance. This informs continued development of prompt-tolerant segmentation architectures.

7. Contributions, Limitations, and Future Directions

MedSAM-3 introduces the first promptable-concept segmentation model for medical images via fine-tuning of SAM 3 and a dedicated PCS head for open vocabulary prompts. The MedSAM-3 Agent exemplifies the synergistic integration of segmentation backbones with MLLMs for agentic, multimodal clinical workflows. Evaluated across a broad spectrum of modalities, MedSAM-3 achieves superior accuracy and generalization.

Key limitations include restricted granularity due to concept label coverage, weaker performance for text-only prompts compared to text+box input, and agentic workflow latency or LLM dependency. Future efforts are proposed in scaling concept supervision (mining reports, extended ontologies), uncertainty-aware agentic loops, and holistic 3D promptable segmentation (Liu et al., 24 Nov 2025).

MedSAM-3 represents a convergence of foundation vision–LLMs and clinical precision, with implications for data-efficient, flexible, and semantically faithful medical segmentation pipelines.