IBISAgent: Multimodal Segmentation in Biomedicine
- IBISAgent is an agentic multimodal LLM framework that reformulates biomedical segmentation as an iterative, vision-centric decision process.
- It leverages an overlay mechanism and external segmentation tools to iteratively refine masks, yielding robust generalization and high accuracy.
- The design decouples base LLM capabilities from segmentation training, preserving linguistic reasoning and mitigating catastrophic forgetting.
IBISAgent is an agentic multimodal LLM (MLLM) framework designed to reinforce pixel-level visual reasoning for universal biomedical object referring and segmentation. Departing from prior approaches that integrate segmentation capabilities via implicit tokens or trainable vision-language fusion modules, IBISAgent reformulates segmentation as a vision-centric, multi-step decision process, engaging off-the-shelf segmentation tools in a loop of interleaved reasoning and action outputs. The architecture achieves robust generalization across domains, state-of-the-art segmentation accuracy, and mitigates the risk of catastrophic forgetting inherent in tightly coupled LLM-decoder training paradigms (Jiang et al., 6 Jan 2026).
1. Architectural Overview
IBISAgent operates as a control layer above a (frozen or lightly fine-tuned) base MLLM such as Qwen2.5-VL-7B, with no architectural modifications to the underlying model. There is no introduction of new fusion layers or implicit segmentation tokens; IBISAgent leverages the original next-token generation interface:
- Interaction Loop: At each decision step , the model receives an overlayed image , a semi-transparent blend of the raw image and the current mask . The MLLM outputs:
> ... r_t ... </think>: internal, text-based reasoning for visual focus. > -<action> ... a_t ... </action>: a click action . > > - Tool Integration: The click sequence and prior mask are passed to a segmentation model (e.g., MedSAM2). The tool produces an updated segmentation mask . > > - Iteration: The updated overlay observation is returned to the MLLM, repeating the reasoning-action cycle until a terminal<answer>...</answer>is emitted. > > This vision-centric agentic approach avoids tightly coupling external decoders or LLM weights, preserving the text output space and enabling off-the-shelf tool integration. > > ## 2. Formalization as a Multi-Step Markov Decision Process > > Segmentation is cast as a -step Markov Decision Process (MDP), with the agentic state at step defined as > > > > where is the input query, is the full action-reasoning-observation history up to , and is the current segmentation mask. The action space consists of either a final-answer output or a click action . The transitions follow: > The agent samples its next internal reasoning and action from its policy > > > > under the standard next-token distribution of the LLM. > > ## 3. Two-Stage Training Methodology > > ### 3.1 Cold-Start Supervised Fine-Tuning (SFT) > > IBISAgent's SFT utilizes 456,000 synthetic reasoning+click trajectories derived from BioMedParseData, each comprising explicit overlay-mask/observation and action pairs, initiated from an empty mask. Oracle-style prompts, delivered by a GPT-5 "teacher," generate structured<think>...reasoning traces. The SFT loss is: where are ground-truth stepwise trajectories. Self-correction cases are masked out of the loss.
3.2 Agentic Reinforcement Learning (RL)
Subsequent RL is conducted on 886,000 VQA samples (564K with segmentation, 322K standard VQA, no ground-truth action traces), employing a reward function with five components:
- Format reward : Validates correct ordering of
<think>,<action>,<answer>. - Answer reward : For segmentation, returns $3$ for , $2$ if , $1$ for , $0$ otherwise.
- Click-placement reward : /-1 if positive/negative clicks correctly target FN/FP regions, using
- Progressive segmentation reward : Indicates if .
- Trajectory length reward : if , penalized by otherwise.
Normalized final reward: Agentic RL is optimized with a clipped policy-gradient (GRPO variant): with the standardized rollout reward advantage.
4. Iterative Visual Reasoning and Mask Refinement
IBISAgent's recurrent overlay mechanism ensures pixel-level reasoning is conditioned on the evolving mask state. At each loop, the visual encoder re-extracts features from the composite image and incremental mask. This construction enables the MLLM to focus reasoning on unresolved image regions, supporting iterative reduction of false positive/negative mask areas. Each reasoning-action step is thus explicitly grounded in the spatial structure of errors in and permits precision click targeting.
5. Empirical Performance and Ablations
Core Benchmark Results
IBISAgent demonstrates marked improvement over state-of-the-art baselines on diverse in-domain, cross-domain, and held-out settings:
| Benchmark | In-domain IoU/DSC/F1 | MeCOVQA-G+ IoU/DSC/F1 | Held-out IoU/DSC/F1 |
|---|---|---|---|
| MedPLIB | 85.58 / 92.21 / 96.39 | 80.63 / 89.27 / 95.24 | 72.09 / 83.78 / 91.76 |
| Citrus-V | 30.61 / 37.63 / 53.75 | 46.54 / 52.65 / 69.84 | 32.08 / 38.63 / 50.76 |
| UniBiomed | 50.74 / 58.31 / 69.22 | 24.88 / 31.74 / 43.63 | 35.62 / 41.55 / 54.97 |
Training Stage and Reward Ablations
| Model | SFT | Reflect | RL | IoU | DSC | F1 |
|---|---|---|---|---|---|---|
| Base w/o tool | 11.8 | 16.8 | 23.5 | |||
| +SFT only | ✓ | 53.4 | 62.0 | 68.6 | ||
| +Reflect | ✓ | ✓ | 57.2 | 67.7 | 74.5 | |
| +RL only | ✓ | 62.8 | 71.3 | 77.5 | ||
| IBISAgent (full) | ✓ | ✓ | ✓ | 72.1 | 83.8 | 91.8 |
Reward-signal ablation on MeCOVQA-G+ further corroborates the impact of the tailored RL reward:
| IoU | Avg. Steps | |||
|---|---|---|---|---|
| - | - | - | 73.8 | 11.3 |
| ✓ | - | - | 76.6 | 10.6 |
| - | ✓ | - | 77.6 | 8.6 |
| - | - | ✓ | 74.2 | 5.9 |
| ✓ | ✓ | ✓ | 80.6 | 3.7 |
6. Generalization and Avoidance of Catastrophic Forgetting
Through a strict decoupling of core MLLM parameters from pixel decoder training, IBISAgent maintains linguistic competency and avoids the catastrophic forgetting observed in approaches with learned segmentation tokens or fused decoders (e.g., LISA-family models). The agentic tool-call interface and iterative RL enable the model to generalize to unseen modalities and out-of-domain settings:
- On MeCOVQA-G+ (five modalities not present in SFT data), IBISAgent achieves 80.6 IoU, surpassing Citrus-V (46.5) and UniBiomed (24.9).
- On internal held-out data, IBISAgent delivers a 72.1 IoU, compared to the 20.1–35.6 range of prior medical MLLMs.
- Removing the RL training phase results in a ~9 IoU decline on held-out samples, demonstrating the necessity of agentic RL for robust generalization.
The iterative, action-conditioned regime, combined with dense, anatomically informed reward signals, fosters visual reasoning grounded in the image content rather than dataset-level statistical regularities.
7. Significance in Biomedical Visual Reasoning
IBISAgent establishes a paradigm for integrating vision-language systems with external tools in a closed-loop, agentic workflow. Its advances are fourfold:
- Preservation of base LLM integrity: By requiring no architectural change or new token types, foundational language and reasoning abilities are retained.
- Universal tool-based segmentation: Any compatible mask-generating model (e.g., MedSAM2) can serve as the external tool, supporting plug-and-play modularity.
- Iteratively refined outputs: Multi-step MDP framing enables mask improvement beyond single-pass approaches.
- Empirical superiority and transfer: SOTA results across benchmarks, robust cross-modality transfer, and resilience to catastrophic forgetting position IBISAgent as a leading approach for universal biomedical object referring and segmentation (Jiang et al., 6 Jan 2026).