Papers
Topics
Authors
Recent
2000 character limit reached

MedSAM-3 Agent: LLM-Enhanced Segmentation

Updated 26 November 2025
  • MedSAM-3 Agent is an agent-in-the-loop segmentation framework that integrates a medically fine-tuned SAM 3 backbone with a Multimodal LLM for text-driven iterative segmentation.
  • It employs a dual-encoder–decoder architecture with a detector, tracker, and memory module to achieve precise segmentation across 2D and video modalities.
  • The system leverages open-vocabulary clinical prompts and iterative mask refinement to overcome annotation challenges and enhance performance in diverse medical imaging tasks.

MedSAM-3 Agent is an agent-in-the-loop segmentation framework integrating a medically fine-tuned Segment Anything Model (SAM 3) backbone with a Multimodal LLM (MLLM) for iterative, text-driven segmentation of medical images and video. The agent achieves medical Promptable Concept Segmentation (PCS), leveraging open-vocabulary clinical prompts and iterative mask refinement to address generalizability and annotation barriers endemic to medical image analysis (Liu et al., 24 Nov 2025).

1. System Architecture

MedSAM-3 Agent builds on a dual-encoder–decoder paradigm with an MLLM-driven agent loop. The modular architecture includes:

  • Perception Encoder (PE): Employs a frozen vision backbone (ViT/ResNet) and text encoder to produce aligned image (FF) and prompt (ee) embeddings.
  • Detector Head: A transformer decoder stack that fuses PE outputs, creating segmentation mask logits (SS) and soft masks M(x,p)=σ(S)M(x,p) = \sigma(S), fine-tuned on paired medical image–concept data.
  • Tracker (+Memory): For video segmentation, a memory bank maintains temporal frame features with self- and cross-attention, inherited from MedSAM2.
  • MLLM Reasoning Module: An external MLLM (e.g., Gemini 3 Pro) parses high-level queries, formulates or refines prompts, evaluates returned masks, and iteratively interacts with MedSAM-3 until task completion.

The agent pipeline, as diagrammed in Figure 1 and Figure 2 of the paper, orchestrates data flow as follows: an input image or video frame xx is embedded into FF; a text prompt pp yields embedding ee; combined via the detector head, SS is predicted; mask MM is returned to the agent; the MLLM inspects MM, updates context, and produces a refined prompt pp', looping until convergence (Liu et al., 24 Nov 2025).

2. Promptable Concept Segmentation (PCS) Formulation

PCS represents the shift from geometric (point/box) to open-vocabulary text prompts for region-of-interest targeting:

  • xRH×W×Cx \in \mathbb{R}^{H\times W\times C}: input medical image/frame sequence.
  • pp: short noun-phrase text prompt (3\leq 3 words).
  • Eimg,EtxtE_{\mathrm{img}}, E_{\mathrm{txt}}: frozen image/text encoders.
  • F=Eimg(x)Rh×w×dvF = E_{\mathrm{img}}(x) \in \mathbb{R}^{h \times w \times d_v}, e=Etxt(p)Rdte = E_{\mathrm{txt}}(p) \in \mathbb{R}^{d_t}.
  • Detector Head: ϕdet\phi_{\mathrm{det}} with parameters θdet\theta_{\mathrm{det}} fuses F,eF,e, yielding mask logits SRh×wS \in \mathbb{R}^{h \times w}.
  • Mask Prediction: M(x,p)=σ(S)[0,1]h×wM(x,p) = \sigma(S) \in [0,1]^{h\times w}, where σ\sigma is the sigmoid.

Supervision uses a composite loss:

Lseg=LBCE(M(x,p),y)+λLDice(M(x,p),y)L_{\mathrm{seg}} = L_{\mathrm{BCE}}(M(x,p),y) + \lambda L_{\mathrm{Dice}}(M(x,p),y)

with

LBCE=i[yilogmi+(1yi)log(1mi)]L_{\mathrm{BCE}} = -\sum_i [y_i \log m_i + (1-y_i)\log(1-m_i)]

LDice=12imiyiimi+iyiL_{\mathrm{Dice}} = 1 - \frac{2\sum_i m_iy_i}{\sum_i m_i + \sum_i y_i}

The fine-tuning objective is

θdet=argminθdeti=1NLseg(Mθdet(xi,pi),yi)\theta_{\mathrm{det}}^* = \operatorname{argmin}_{\theta_{\mathrm{det}}} \sum_{i=1}^{N} L_{\mathrm{seg}}(M_{\theta_{\mathrm{det}}}(x_i,p_i), y_i)

This approach decouples conceptual segmentation from pure geometry, enabling direct mapping from free-form clinical phrases to anatomical mask predictions (Liu et al., 24 Nov 2025).

3. Agent Integration and Reasoning Loop

MedSAM-3 Agent operationalizes agentic refinement through tightly-coupled LLM–segmenter interaction, formalized as:

Algorithm 1: MedSAM-3 Agent Workflow

  • Input: image/video II, clinical query QQ
  • Initialize memory M0M_0 \gets \varnothing, t1t \gets 1
  • Repeat (until done\mathrm{done} or t>Tmaxt>T_{\mathrm{max}}):

    1. ptMLLM.plan(I,Q,Mt1)p_t \leftarrow \mathrm{MLLM.plan}(I, Q, M_{t-1})
    2. $m_t \leftarrow \mathrm{MedSAM\mbox{-}3.segment}(I, p_t)$
    3. ftMLLM.evaluate(mt,Q)f_t \leftarrow \mathrm{MLLM.evaluate}(m_t, Q)
    4. MtMt1{(pt,mt,ft)}M_t \leftarrow M_{t-1} \cup \{(p_t, m_t, f_t)\}
    5. If MLLM.confidence(ft)τMLLM.confidence(f_t) \geq \tau, break
    6. tt+1t \gets t + 1
  • Return final set of masks {m1,,mt}\{m_1,\ldots,m_t\}

The MLLM sequentially plans prompts, evaluates segmentation outputs, provides feedback (e.g., mask coverage, semantic fit), and iteratively sharpens task execution. Termination occurs upon reaching internal confidence threshold τ\tau, plateau in predicted IoU change <ϵ<\epsilon, or TmaxT_{\mathrm{max}} rounds (usually 3). This loop enhances open-vocabulary alignment and reduces systematic segmentation errors (Liu et al., 24 Nov 2025).

4. Training Procedures and Modalities

MedSAM-3 Agent’s components are trained with the following regimen:

  • Encoder Freezing: The PE (vision/text encoders) are fixed; only the detector head is fine-tuned.
  • Prompt Regimes: Two paradigms—MedSAM-3 T (text-only prompts), MedSAM-3 T+I (text plus GT bounding box prompts).
  • Modalities: Evaluation spans BUSI Ultrasound (US), RIM-ONE fundus, ISIC skin, and Kvasir endoscopy (all 2D). Video tracking leverages the memory module.
  • Optimization: AdamW, learning rate 1×104\sim 1\times 10^{-4}, batch size \sim16 on A100 GPUs.
  • Losses: As in PCS formulation, LsegL_{\mathrm{seg}}.
  • A plausible implication is that decoupling visual backbone and detector head adaptation facilitates domain transfer while minimizing computational overhead.

This paradigm establishes robust medical segmentation in low-data or domain-shifted contexts (Liu et al., 24 Nov 2025).

5. Agent-in-the-Loop Refinement and Convergence

The agent refines segmentation via an MLLM-driven workflow:

  • Post-mask evaluation includes approximated IoU, boundary checks, and semantic region analysis.
  • If essential regions are absent or masks are excessive, the agent suggests prompt edits (e.g., “focus on the left lobe,” “shrink boundary by 3 pixels”).
  • Convergence is declared when: (a) predicted IoU change <ϵ<\epsilon in two consecutive rounds; (b) MLLM confidence >τ>\tau (e.g., 0.9); or (c) round tt reaches TmaxT_{\mathrm{max}} (typically 3).
  • This suggests that dynamic prompt refinement, guided by both semantic (language) and geometric (mask evaluation) feedback, enables classically hard-to-script intelligent behaviors for complex anatomical structures.

6. Quantitative Performance and Comparative Metrics

MedSAM-3 Agent outperforms both specialist architectures and generalist models across diverse medical imaging modalities. Representative Dice similarities (higher is better):

Modality Baseline/Model Dice Score
X-ray (COVID-QU-Ex) U-Net 0.7880
Unet3+ 0.7928
SAM 3 T+I 0.7405 (↓0.05)
Ultrasound (BUSI) U-Net 0.7618
Unet3+ 0.7782
MedSAM-3 T+I 0.8831 (+0.10)
MedSAM-3 Agent 0.8064 (+0.06)
MRI (PROMISE12) nn-U-Net 0.9011
Swin UNETR 0.8934
U-Mamba 0.9002
SAM 3 T 0.6110 (↓0.29)
CT (LiTS) nn-U-Net 0.7714
Swin UNETR 0.7425
U-Mamba 0.7910
SAM 3 T 0.1374 (↓0.65)
Video (PolypGen) Polyp-PVT 0.6205
SAM 3 T+I 0.6903 (↑0.07)

Key findings include:

  • Off-the-shelf SAM 3 fails on domain-shifted medical tasks without geometric prompts (Dice 0\approx 0–$0.3$).
  • MedSAM-3 T+I recovers or exceeds domain-specialist benchmarks in 2D segmentation.
  • Agentic refinement with Gemini 3 Pro LLM adds +0.03+0.03–$0.05$ Dice improvements in complex instances.
  • A plausible implication is that LLM-guided agentic loops systematically mitigate hard cases where direct prompt-to-mask mappings are insufficient (Liu et al., 24 Nov 2025).

7. Pipeline Visualization and Operational Summary

The pipeline is structured as depicted in Figure 1 (MedSAM-3 blocks: PE, Detector, Tracker+Memory) and Figure 2 (agent-in-the-loop: LLM Planning → MedSAM-3 Segment → LLM Evaluation → Prompt Refinement ...). The integrated agent loop extends MedSAM-3 to:

  • Ground open-vocabulary clinical prompts into anatomically precise masks.
  • Reduce annotation burden and tune segmentation for evolving clinical goals.
  • Achieve high sample efficiency and adaptability for previously unsupported modalities.

As summarized in the originating paper, MedSAM-3 Agent constitutes a medically adapted, concept-driven segmentation system that leverages an agentic LLM-vision loop to raise performance and flexibility ceilings in biomedical image analysis (Liu et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MedSAM-3 Agent.