MedSAM-3 Agent: LLM-Enhanced Segmentation

Updated 26 November 2025

MedSAM-3 Agent is an agent-in-the-loop segmentation framework that integrates a medically fine-tuned SAM 3 backbone with a Multimodal LLM for text-driven iterative segmentation.
It employs a dual-encoder–decoder architecture with a detector, tracker, and memory module to achieve precise segmentation across 2D and video modalities.
The system leverages open-vocabulary clinical prompts and iterative mask refinement to overcome annotation challenges and enhance performance in diverse medical imaging tasks.

MedSAM-3 Agent is an agent-in-the-loop segmentation framework integrating a medically fine-tuned Segment Anything Model (SAM 3) backbone with a Multimodal LLM (MLLM) for iterative, text-driven segmentation of medical images and video. The agent achieves medical Promptable Concept Segmentation (PCS), leveraging open-vocabulary clinical prompts and iterative mask refinement to address generalizability and annotation barriers endemic to medical image analysis (Liu et al., 24 Nov 2025).

1. System Architecture

MedSAM-3 Agent builds on a dual-encoder–decoder paradigm with an MLLM-driven agent loop. The modular architecture includes:

Perception Encoder (PE): Employs a frozen vision backbone (ViT/ResNet) and text encoder to produce aligned image ( $F$ ) and prompt ( $e$ ) embeddings.
Detector Head: A transformer decoder stack that fuses PE outputs, creating segmentation mask logits ( $S$ ) and soft masks $M(x,p) = \sigma(S)$ , fine-tuned on paired medical image–concept data.
Tracker (+Memory): For video segmentation, a memory bank maintains temporal frame features with self- and cross-attention, inherited from MedSAM2.
MLLM Reasoning Module: An external MLLM (e.g., Gemini 3 Pro) parses high-level queries, formulates or refines prompts, evaluates returned masks, and iteratively interacts with MedSAM-3 until task completion.

The agent pipeline, as diagrammed in Figure 1 and Figure 2 of the paper, orchestrates data flow as follows: an input image or video frame $x$ is embedded into $F$ ; a text prompt $p$ yields embedding $e$ ; combined via the detector head, $S$ is predicted; mask $M$ is returned to the agent; the MLLM inspects $M$ , updates context, and produces a refined prompt $p'$ , looping until convergence (Liu et al., 24 Nov 2025).

2. Promptable Concept Segmentation (PCS) Formulation

PCS represents the shift from geometric (point/box) to open-vocabulary text prompts for region-of-interest targeting:

$x \in \mathbb{R}^{H\times W\times C}$ : input medical image/frame sequence.
$p$ : short noun-phrase text prompt ( $\leq 3$ words).
$E_{\mathrm{img}}, E_{\mathrm{txt}}$ : frozen image/text encoders.
$F = E_{\mathrm{img}}(x) \in \mathbb{R}^{h \times w \times d_v}$ , $e = E_{\mathrm{txt}}(p) \in \mathbb{R}^{d_t}$ .
Detector Head: $\phi_{\mathrm{det}}$ with parameters $\theta_{\mathrm{det}}$ fuses $F,e$ , yielding mask logits $S \in \mathbb{R}^{h \times w}$ .
Mask Prediction: $M(x,p) = \sigma(S) \in [0,1]^{h\times w}$ , where $\sigma$ is the sigmoid.

Supervision uses a composite loss:

$L_{\mathrm{seg}} = L_{\mathrm{BCE}}(M(x,p),y) + \lambda L_{\mathrm{Dice}}(M(x,p),y)$

with

$L_{\mathrm{BCE}} = -\sum_i [y_i \log m_i + (1-y_i)\log(1-m_i)]$

$L_{\mathrm{Dice}} = 1 - \frac{2\sum_i m_iy_i}{\sum_i m_i + \sum_i y_i}$

The fine-tuning objective is

$\theta_{\mathrm{det}}^* = \operatorname{argmin}_{\theta_{\mathrm{det}}} \sum_{i=1}^{N} L_{\mathrm{seg}}(M_{\theta_{\mathrm{det}}}(x_i,p_i), y_i)$

This approach decouples conceptual segmentation from pure geometry, enabling direct mapping from free-form clinical phrases to anatomical mask predictions (Liu et al., 24 Nov 2025).

3. Agent Integration and Reasoning Loop

MedSAM-3 Agent operationalizes agentic refinement through tightly-coupled LLM–segmenter interaction, formalized as:

Algorithm 1: MedSAM-3 Agent Workflow

Input: image/video $I$ , clinical query $Q$
Initialize memory $M_0 \gets \varnothing$ , $t \gets 1$
Repeat (until $\mathrm{done}$ $done$ or $t>T_{\mathrm{max}}$ $t > T_{max}$ ):
1. $p_t \leftarrow \mathrm{MLLM.plan}(I, Q, M_{t-1})$
2. $m_t \leftarrow \mathrm{MedSAM\mbox{-}3.segment}(I, p_t)$
3. $f_t \leftarrow \mathrm{MLLM.evaluate}(m_t, Q)$
4. $M_t \leftarrow M_{t-1} \cup \{(p_t, m_t, f_t)\}$
5. If $MLLM.confidence(f_t) \geq \tau$ , break
6. $t \gets t + 1$
Return final set of masks $\{m_1,\ldots,m_t\}$

The MLLM sequentially plans prompts, evaluates segmentation outputs, provides feedback (e.g., mask coverage, semantic fit), and iteratively sharpens task execution. Termination occurs upon reaching internal confidence threshold $\tau$ , plateau in predicted IoU change $<\epsilon$ , or $T_{\mathrm{max}}$ rounds (usually 3). This loop enhances open-vocabulary alignment and reduces systematic segmentation errors (Liu et al., 24 Nov 2025).

4. Training Procedures and Modalities

MedSAM-3 Agent’s components are trained with the following regimen:

Encoder Freezing: The PE (vision/text encoders) are fixed; only the detector head is fine-tuned.
Prompt Regimes: Two paradigms—MedSAM-3 T (text-only prompts), MedSAM-3 T+I (text plus GT bounding box prompts).
Modalities: Evaluation spans BUSI Ultrasound (US), RIM-ONE fundus, ISIC skin, and Kvasir endoscopy (all 2D). Video tracking leverages the memory module.
Optimization: AdamW, learning rate $\sim 1\times 10^{-4}$ , batch size $\sim$ 16 on A100 GPUs.
Losses: As in PCS formulation, $L_{\mathrm{seg}}$ .
A plausible implication is that decoupling visual backbone and detector head adaptation facilitates domain transfer while minimizing computational overhead.

This paradigm establishes robust medical segmentation in low-data or domain-shifted contexts (Liu et al., 24 Nov 2025).

The agent refines segmentation via an MLLM-driven workflow:

Post-mask evaluation includes approximated IoU, boundary checks, and semantic region analysis.
If essential regions are absent or masks are excessive, the agent suggests prompt edits (e.g., “focus on the left lobe,” “shrink boundary by 3 pixels”).
Convergence is declared when: (a) predicted IoU change $<\epsilon$ in two consecutive rounds; (b) MLLM confidence $>\tau$ (e.g., 0.9); or (c) round $t$ reaches $T_{\mathrm{max}}$ (typically 3).
This suggests that dynamic prompt refinement, guided by both semantic (language) and geometric (mask evaluation) feedback, enables classically hard-to-script intelligent behaviors for complex anatomical structures.

6. Quantitative Performance and Comparative Metrics

MedSAM-3 Agent outperforms both specialist architectures and generalist models across diverse medical imaging modalities. Representative Dice similarities (higher is better):

Modality	Baseline/Model	Dice Score
X-ray (COVID-QU-Ex)	U-Net	0.7880
	Unet3+	0.7928
	SAM 3 T+I	0.7405 (↓0.05)
Ultrasound (BUSI)	U-Net	0.7618
	Unet3+	0.7782
	MedSAM-3 T+I	0.8831 (+0.10)
	MedSAM-3 Agent	0.8064 (+0.06)
MRI (PROMISE12)	nn-U-Net	0.9011
	Swin UNETR	0.8934
	U-Mamba	0.9002
	SAM 3 T	0.6110 (↓0.29)
CT (LiTS)	nn-U-Net	0.7714
	Swin UNETR	0.7425
	U-Mamba	0.7910
	SAM 3 T	0.1374 (↓0.65)
Video (PolypGen)	Polyp-PVT	0.6205
	SAM 3 T+I	0.6903 (↑0.07)

Key findings include:

Off-the-shelf SAM 3 fails on domain-shifted medical tasks without geometric prompts (Dice $\approx 0$ –$0.3$).
MedSAM-3 T+I recovers or exceeds domain-specialist benchmarks in 2D segmentation.
Agentic refinement with Gemini 3 Pro LLM adds $+0.03$ –$0.05$ Dice improvements in complex instances.
A plausible implication is that LLM-guided agentic loops systematically mitigate hard cases where direct prompt-to-mask mappings are insufficient (Liu et al., 24 Nov 2025).

7. Pipeline Visualization and Operational Summary

The pipeline is structured as depicted in Figure 1 (MedSAM-3 blocks: PE, Detector, Tracker+Memory) and Figure 2 (agent-in-the-loop: LLM Planning → MedSAM-3 Segment → LLM Evaluation → Prompt Refinement ...). The integrated agent loop extends MedSAM-3 to:

Ground open-vocabulary clinical prompts into anatomically precise masks.
Reduce annotation burden and tune segmentation for evolving clinical goals.
Achieve high sample efficiency and adaptability for previously unsupported modalities.

As summarized in the originating study, MedSAM-3 Agent constitutes a medically adapted, concept-driven segmentation system that leverages an agentic LLM-vision loop to raise performance and flexibility ceilings in biomedical image analysis (Liu et al., 24 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MedSAM3: Delving into Segment Anything with Medical Concepts (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MedSAM-3 Agent.

MedSAM-3 Agent: LLM-Enhanced Segmentation

1. System Architecture

2. Promptable Concept Segmentation (PCS) Formulation

3. Agent Integration and Reasoning Loop

4. Training Procedures and Modalities

5. Agent-in-the-Loop Refinement and Convergence

6. Quantitative Performance and Comparative Metrics

7. Pipeline Visualization and Operational Summary

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MedSAM-3 Agent: LLM-Enhanced Segmentation

1. System Architecture

2. Promptable Concept Segmentation (PCS) Formulation

3. Agent Integration and Reasoning Loop

4. Training Procedures and Modalities

5. Agent-in-the-Loop Refinement and Convergence

6. Quantitative Performance and Comparative Metrics

7. Pipeline Visualization and Operational Summary

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research