MedSAM-3 Agent: LLM-Enhanced Segmentation
- MedSAM-3 Agent is an agent-in-the-loop segmentation framework that integrates a medically fine-tuned SAM 3 backbone with a Multimodal LLM for text-driven iterative segmentation.
- It employs a dual-encoder–decoder architecture with a detector, tracker, and memory module to achieve precise segmentation across 2D and video modalities.
- The system leverages open-vocabulary clinical prompts and iterative mask refinement to overcome annotation challenges and enhance performance in diverse medical imaging tasks.
MedSAM-3 Agent is an agent-in-the-loop segmentation framework integrating a medically fine-tuned Segment Anything Model (SAM 3) backbone with a Multimodal LLM (MLLM) for iterative, text-driven segmentation of medical images and video. The agent achieves medical Promptable Concept Segmentation (PCS), leveraging open-vocabulary clinical prompts and iterative mask refinement to address generalizability and annotation barriers endemic to medical image analysis (Liu et al., 24 Nov 2025).
1. System Architecture
MedSAM-3 Agent builds on a dual-encoder–decoder paradigm with an MLLM-driven agent loop. The modular architecture includes:
- Perception Encoder (PE): Employs a frozen vision backbone (ViT/ResNet) and text encoder to produce aligned image () and prompt () embeddings.
- Detector Head: A transformer decoder stack that fuses PE outputs, creating segmentation mask logits () and soft masks , fine-tuned on paired medical image–concept data.
- Tracker (+Memory): For video segmentation, a memory bank maintains temporal frame features with self- and cross-attention, inherited from MedSAM2.
- MLLM Reasoning Module: An external MLLM (e.g., Gemini 3 Pro) parses high-level queries, formulates or refines prompts, evaluates returned masks, and iteratively interacts with MedSAM-3 until task completion.
The agent pipeline, as diagrammed in Figure 1 and Figure 2 of the paper, orchestrates data flow as follows: an input image or video frame is embedded into ; a text prompt yields embedding ; combined via the detector head, is predicted; mask is returned to the agent; the MLLM inspects , updates context, and produces a refined prompt , looping until convergence (Liu et al., 24 Nov 2025).
2. Promptable Concept Segmentation (PCS) Formulation
PCS represents the shift from geometric (point/box) to open-vocabulary text prompts for region-of-interest targeting:
- : input medical image/frame sequence.
- : short noun-phrase text prompt ( words).
- : frozen image/text encoders.
- , .
- Detector Head: with parameters fuses , yielding mask logits .
- Mask Prediction: , where is the sigmoid.
Supervision uses a composite loss:
with
The fine-tuning objective is
This approach decouples conceptual segmentation from pure geometry, enabling direct mapping from free-form clinical phrases to anatomical mask predictions (Liu et al., 24 Nov 2025).
3. Agent Integration and Reasoning Loop
MedSAM-3 Agent operationalizes agentic refinement through tightly-coupled LLM–segmenter interaction, formalized as:
Algorithm 1: MedSAM-3 Agent Workflow
- Input: image/video , clinical query
- Initialize memory ,
- Repeat (until or ):
- $m_t \leftarrow \mathrm{MedSAM\mbox{-}3.segment}(I, p_t)$
- If , break
Return final set of masks
The MLLM sequentially plans prompts, evaluates segmentation outputs, provides feedback (e.g., mask coverage, semantic fit), and iteratively sharpens task execution. Termination occurs upon reaching internal confidence threshold , plateau in predicted IoU change , or rounds (usually 3). This loop enhances open-vocabulary alignment and reduces systematic segmentation errors (Liu et al., 24 Nov 2025).
4. Training Procedures and Modalities
MedSAM-3 Agent’s components are trained with the following regimen:
- Encoder Freezing: The PE (vision/text encoders) are fixed; only the detector head is fine-tuned.
- Prompt Regimes: Two paradigms—MedSAM-3 T (text-only prompts), MedSAM-3 T+I (text plus GT bounding box prompts).
- Modalities: Evaluation spans BUSI Ultrasound (US), RIM-ONE fundus, ISIC skin, and Kvasir endoscopy (all 2D). Video tracking leverages the memory module.
- Optimization: AdamW, learning rate , batch size 16 on A100 GPUs.
- Losses: As in PCS formulation, .
- A plausible implication is that decoupling visual backbone and detector head adaptation facilitates domain transfer while minimizing computational overhead.
This paradigm establishes robust medical segmentation in low-data or domain-shifted contexts (Liu et al., 24 Nov 2025).
5. Agent-in-the-Loop Refinement and Convergence
The agent refines segmentation via an MLLM-driven workflow:
- Post-mask evaluation includes approximated IoU, boundary checks, and semantic region analysis.
- If essential regions are absent or masks are excessive, the agent suggests prompt edits (e.g., “focus on the left lobe,” “shrink boundary by 3 pixels”).
- Convergence is declared when: (a) predicted IoU change in two consecutive rounds; (b) MLLM confidence (e.g., 0.9); or (c) round reaches (typically 3).
- This suggests that dynamic prompt refinement, guided by both semantic (language) and geometric (mask evaluation) feedback, enables classically hard-to-script intelligent behaviors for complex anatomical structures.
6. Quantitative Performance and Comparative Metrics
MedSAM-3 Agent outperforms both specialist architectures and generalist models across diverse medical imaging modalities. Representative Dice similarities (higher is better):
| Modality | Baseline/Model | Dice Score |
|---|---|---|
| X-ray (COVID-QU-Ex) | U-Net | 0.7880 |
| Unet3+ | 0.7928 | |
| SAM 3 T+I | 0.7405 (↓0.05) | |
| Ultrasound (BUSI) | U-Net | 0.7618 |
| Unet3+ | 0.7782 | |
| MedSAM-3 T+I | 0.8831 (+0.10) | |
| MedSAM-3 Agent | 0.8064 (+0.06) | |
| MRI (PROMISE12) | nn-U-Net | 0.9011 |
| Swin UNETR | 0.8934 | |
| U-Mamba | 0.9002 | |
| SAM 3 T | 0.6110 (↓0.29) | |
| CT (LiTS) | nn-U-Net | 0.7714 |
| Swin UNETR | 0.7425 | |
| U-Mamba | 0.7910 | |
| SAM 3 T | 0.1374 (↓0.65) | |
| Video (PolypGen) | Polyp-PVT | 0.6205 |
| SAM 3 T+I | 0.6903 (↑0.07) |
Key findings include:
- Off-the-shelf SAM 3 fails on domain-shifted medical tasks without geometric prompts (Dice –$0.3$).
- MedSAM-3 T+I recovers or exceeds domain-specialist benchmarks in 2D segmentation.
- Agentic refinement with Gemini 3 Pro LLM adds –$0.05$ Dice improvements in complex instances.
- A plausible implication is that LLM-guided agentic loops systematically mitigate hard cases where direct prompt-to-mask mappings are insufficient (Liu et al., 24 Nov 2025).
7. Pipeline Visualization and Operational Summary
The pipeline is structured as depicted in Figure 1 (MedSAM-3 blocks: PE, Detector, Tracker+Memory) and Figure 2 (agent-in-the-loop: LLM Planning → MedSAM-3 Segment → LLM Evaluation → Prompt Refinement ...). The integrated agent loop extends MedSAM-3 to:
- Ground open-vocabulary clinical prompts into anatomically precise masks.
- Reduce annotation burden and tune segmentation for evolving clinical goals.
- Achieve high sample efficiency and adaptability for previously unsupported modalities.
As summarized in the originating paper, MedSAM-3 Agent constitutes a medically adapted, concept-driven segmentation system that leverages an agentic LLM-vision loop to raise performance and flexibility ceilings in biomedical image analysis (Liu et al., 24 Nov 2025).