MedSAM-Agent: Autonomous Medical Segmentation
- MedSAM-Agent is a framework for interactive, autonomous segmentation that integrates foundation models, MLLMs, and agentic reinforcement learning to refine medical images stepwise.
- It utilizes multimodal inputs such as text, geometric cues, and weak supervision to iteratively improve segmentation accuracy across diverse medical imaging modalities.
- The system combines agent-in-the-loop strategies with RL-based optimization, achieving robust metrics in multicenter, multimodal benchmarks while reducing annotation effort.
MedSAM-Agent is a framework for interactive and autonomous medical image segmentation that leverages advances in foundation segmentation models, multimodal LLMs (MLLMs), agentic reinforcement learning, and weak supervision. MedSAM-Agent orchestrates stepwise segmentation through iterative decision-making, process-level feedback, and high-level clinical reasoning, integrating domain-specialized vision backbones and flexible prompting. Multiple independently developed agents under the “MedSAM-Agent” designation advance medical image segmentation beyond single-turn, rigid interaction paradigms by enabling efficient, robust, and generalizable mask refinement across a wide range of medical imaging modalities (Liu et al., 24 Nov 2025, Ma et al., 2023, Liu et al., 2024, Gaillochet et al., 2024, Liu et al., 3 Feb 2026).
1. Core Architectures and Computational Paradigms
MedSAM-Agent refers to agentic systems built around the MedSAM or related foundation models, capable of stepwise, feedback-driven segmentation via explicit interaction policies. Architectures fall into several categories:
- Agent-in-the-loop with MLLMs: A reasoning module (e.g., Gemini 3 Pro, Qwen3-VL-8B) plans sub-tasks (text/geometry prompts) for the segmentation backbone. It observes images, textual instructions, and previously produced segmentations, emitting prompts that are executed by a masked-segmentation tool (MedSAM-3, SAM2.1-Base, etc.). Iterative feedback—including segmentation confidence or mask diagnostics—enables dynamic prompt refinement (Liu et al., 24 Nov 2025, Liu et al., 3 Feb 2026).
- Reinforcement Learning with Verifiable Rewards (RLVR): MedSAM-Agent reformulates segmentation as a Markov Decision Process (MDP), with multi-turn interactions governed by a trainable agent policy. The agent receives reward signals for process efficiency and outcome quality, and is trained via supervised trajectory imitation followed by RL with process- and outcome-level rewards (Liu et al., 3 Feb 2026).
- Prompt Learning and Automated Agents: Other variants replace user/LLM prompts with lightweight learned prompt modules. These can be trained using weak, few-shot (e.g., 10 annotated boxes) supervision, replacing interactive prompts by image-derived embeddings and enabling automatic segmentation (Gaillochet et al., 2024).
- Iterative Point-to-Box and Proposal Loops: Some MedSAM-Agents use a semantic box-prompt generator to convert sparse cues (points) into box proposals, then apply segmentation and update proposals iteratively, refining the mask selection with minimal annotation (Liu et al., 2024).
Key architectural elements include a perception backbone (ViT or ViT-derived encoder), flexible prompt encoding (boxes, points, text), mask decoding, and orchestration/planning modules (MLLMs or trainable policy networks).
2. Interactive Segmentation and Reasoning Loops
MedSAM-Agent formalizes segmentation as a sequence of planning, execution, and feedback:
- Planning: At each iteration , the agent (MLLM or policy network) receives the user instruction , current image , segmentation/mask history , and emits a sub-prompt along with geometric cues (bounding box , point set , or textual phrase).
- Execution: The MedSAM-based segmentation backbone takes in and returns a binary mask .
- Feedback: A feedback and memory module computes segmentation confidence (e.g., via Dice or IoU with previous mask), stores interaction history , and determines convergence (typically or ).
- Algorithmic Loop:
1 2 3 4 5 6 7 8 |
Initialize history H₀ ← ∅
For t = 1 to T:
1. Plan: MLLM(Hₜ₋₁, U) → (pₜ, gₜ)
2. Segment: Mₜ = S(I; θ, pₜ, gₜ)
3. Compute confidence cₜ
4. Append interaction to Hₜ
5. If cₜ ≥ τ or t ≥ T, break
Return final mask(s) |
The loop enables multi-step, targeted refinement of segmentation boundaries, correction of early missteps, and robust handling of complex, multi-structure instructions.
In RLVR-based MedSAM-Agents, the action space consists of tool calls (e.g., add_bbox, add_point, stop_action), and the agent is trained to act parsimoniously through process-level rewards (format compliance, improvement, parsimony) and outcome-level rewards (final Dice/IoU), using group-rewarded policy gradient methods (Liu et al., 3 Feb 2026).
3. Mathematical Formulations and Optimization
MedSAM-Agent frameworks involve several intertwined optimization objectives:
- Prompt Optimization: Minimize total loss across prompts,
where is the segmentation model, is a segmentation loss (Dice or cross-entropy), and is a regularizer (Liu et al., 24 Nov 2025).
- Agent Loop Convergence: Terminate when
- RL Objective: Unified total reward per rollout :
with policy gradient training via Group Relative Policy Optimization (GRPO):
where is normalized reward advantage and is the importance weight (Liu et al., 3 Feb 2026).
For point-supervised agents, proposal scoring follows:
Backpropagation is limited to the proposal refiner; the segmenter is kept frozen (Liu et al., 2024).
4. Experimental Results and Benchmarks
Empirical evaluation of MedSAM-Agent comprises multicenter, multimodal segmentation across CT, MRI, Ultrasound, X-ray, fundus, and endoscopy data. Results highlight:
- BUSI Breast Ultrasound (Liu et al., 24 Nov 2025):
- MedSAM-3 T+I (baseline): Dice = 0.7772
- MedSAM-3 Agent (1 round): Dice = 0.7925
- MedSAM-3 Agent (3 rounds): Dice = 0.8064
- - Disabling geometric cues in the agent loop degrades performance (Dice = 0.7748).
- Multi-modality, 21 dataset average (MedSAM-Agent RL) (Liu et al., 3 Feb 2026):
| Method | CT | MRI | X-Ray | US | Fundus | Endoscopy | Overall | |------------------|-------|-------|-------|-------|--------|-----------|---------| | SAM2-Box | .863 | .818 | .887 | .808 | .946 | .934 | .876 | | Ours-MedSAM2 | .836 | .890 | .809 | .889 | .876 | .803 | .843 | | Ours-IMISNet | .848 | .911 | .947 | .733 | .976 | .924 | .888 |
- Full SFT+RL agent achieves Dice = .794, IoU = .705, mean 2.11 turns.
- Point-supervised brain tumor segmentation (Liu et al., 2024):
- Point-based agent (T=5): Dice = 65.17% (cf. MedSAM box-supervision Dice = 68.74%)
- Automated few-shot prompt learning (Gaillochet et al., 2024):
- With 10-shot training, Dice = 85.2–88.4% on ultrasound/MRI, outperforming baselines.
Ablation studies demonstrate efficiency of hybrid prompting, reduction of redundant actions and “click spam,” and resilience of agent policies across different segmentation backends (Liu et al., 24 Nov 2025, Liu et al., 3 Feb 2026).
5. Agentic Strategies, Prompts, and Reward Design
MedSAM-Agent incorporates advanced strategies to maximize segmentation fidelity and efficiency:
- Hybrid Prompting: Combines global context (bounding box) and local detail (point corrections), mimicking human annotator heuristics. Sequence generation in expert-guided training uses progress constraints (IoU improvements per interaction) and global quality filtering (Liu et al., 3 Feb 2026).
- Process-level Supervision: In addition to final metric rewards (Dice, IoU), agents are trained with process-centric rewards:
- (improvement): Sum of monotonic IoU improvements per step
- (overshoot penalty): Difference between maximum and final IoU
- (interaction cost): Penalizes excessive tool calls
- (format): Enforces agent action protocol (initiate, terminate)
- Prompt selection and optimization: Includes offline prompt-tuning, iterative gradient refinement of prompt embeddings, and prototype-based selection in point-supervised agents. Prompt design flexibly encompasses medical concept phrases, geometric hints, and feedback-serialized representations (e.g., polygon coordinates, numerical mask confidences) (Liu et al., 24 Nov 2025, Liu et al., 2024).
6. Limitations and Future Directions
MedSAM-Agent, while delivering significant accuracy advancements and streamlined annotation workflows, has notable constraints:
- Inference and Latency: Multi-turn agentic loops can increase inference latency by 2–3× compared to static segmentation models, due to serial MLLM/tool invocation (Liu et al., 24 Nov 2025).
- Domain Dependency: Performance of the reasoning/planner (MLLM) is sensitive to medical-domain reasoning and can degrade under hallucinations or weak prior knowledge (Liu et al., 24 Nov 2025).
- Convergence and Stop Criteria: No theoretical guarantee on iteration counts; thresholds (e.g., for mask stability) require empirical tuning (Liu et al., 24 Nov 2025, Liu et al., 3 Feb 2026).
- Supervision and Modality Imbalance: For foundation models pretrained on imbalanced data, rare modalities may underperform. Interactive agents partially mitigate this via feedback, but fundamental representation limits remain (Ma et al., 2023).
Proposed extensions include reinforcement-learning-based joint planner/segmenter training (Liu et al., 24 Nov 2025), direct uncertainty quantification, video temporal reasoning, 3D mask integration, and continual domain adaptation through active online learning. The framework’s modularity allows agent policies to generalize across segmentation backends and to bridge sparse annotation (points, boxes) with high-fidelity clinical segmentation using minimal supervision (Liu et al., 3 Feb 2026, Liu et al., 2024, Gaillochet et al., 2024).
7. Clinical and Research Impact
MedSAM-Agent provides a systematic solution for interactive annotation, annotation parsimony, and automated segmentation in diverse biomedical imaging contexts. Clinical workflow integration is facilitated through user interfaces supporting bounding boxes, clicks, scribbles, textual requests, and on-the-fly result inspection (Ma et al., 2023). Automated and prompt-learned agents make high-accuracy segmentation feasible with few labels, broadening applicability to rare or under-resourced tasks (Gaillochet et al., 2024, Liu et al., 2024). The framework’s emphasis on principled, multi-turn interaction and reinforcement learning establishes a new paradigm for clinical-grade, autonomous annotation tools that explicitly balance accuracy, efficiency, and human-in-the-loop or fully automated operation (Liu et al., 24 Nov 2025, Liu et al., 3 Feb 2026).