IBISAgent: Multimodal Segmentation in Biomedicine

Updated 13 January 2026

IBISAgent is an agentic multimodal LLM framework that reformulates biomedical segmentation as an iterative, vision-centric decision process.
It leverages an overlay mechanism and external segmentation tools to iteratively refine masks, yielding robust generalization and high accuracy.
The design decouples base LLM capabilities from segmentation training, preserving linguistic reasoning and mitigating catastrophic forgetting.

IBISAgent is an agentic multimodal LLM (MLLM) framework designed to reinforce pixel-level visual reasoning for universal biomedical object referring and segmentation. Departing from prior approaches that integrate segmentation capabilities via implicit tokens or trainable vision-language fusion modules, IBISAgent reformulates segmentation as a vision-centric, multi-step decision process, engaging off-the-shelf segmentation tools in a loop of interleaved reasoning and action outputs. The architecture achieves robust generalization across domains, state-of-the-art segmentation accuracy, and mitigates the risk of catastrophic forgetting inherent in tightly coupled LLM-decoder training paradigms (Jiang et al., 6 Jan 2026).

1. Architectural Overview

IBISAgent operates as a control layer above a (frozen or lightly fine-tuned) base MLLM such as Qwen2.5-VL-7B, with no architectural modifications to the underlying model. There is no introduction of new fusion layers or implicit segmentation tokens; IBISAgent leverages the original next-token generation interface:

Interaction Loop: At each decision step $t$ $t$ , the model receives an overlayed image $o_t = \text{Overlay}(I, M_t)$ $o_{t} = Overlay (I, M_{t})$ , a semi-transparent blend of the raw image $I$ $I$ and the current mask $M_t$ $M_{t}$ . The MLLM outputs:
- > ... r_t ... </think>: internal, text-based reasoning for visual focus. > - <action> ... a_t ... </action>: a click action $a_t = (\text{Target}, p_t \in \{+1,-1\}, (x_t, y_t))$ . > > - Tool Integration: The click sequence $\{a_0 \ldots a_t\}$ and prior mask $M_t$ are passed to a segmentation model $F_{\mathrm{seg}}$ (e.g., MedSAM2). The tool produces an updated segmentation mask $M_{t+1}$ . > > - Iteration: The updated overlay observation $o_{t+1}$ is returned to the MLLM, repeating the reasoning-action cycle until a terminal <answer>...</answer> is emitted. > > This vision-centric agentic approach avoids tightly coupling external decoders or LLM weights, preserving the text output space and enabling off-the-shelf tool integration. > > ## 2. Formalization as a Multi-Step Markov Decision Process > > Segmentation is cast as a $T$ -step Markov Decision Process (MDP), with the agentic state at step $t$ defined as > > $s_t = (I, Q, P_{<t}, M_t),$ > > where $Q$ is the input query, $P_{<t}$ is the full action-reasoning-observation history up to $t-1$ , and $M_t$ is the current segmentation mask. The action space $\mathcal{A}$ consists of either a final-answer output or a click action $a_t$ . The transitions follow: $o_t = \text{Overlay}(I, M_t), \quad M_{t+1} = F_{\text{seg}}(I; \{a_0 \ldots a_t\}, M_t), \quad s_{t+1} = (I, Q, P_{<t+1}, M_{t+1}).$ > The agent samples its next internal reasoning and action from its policy > > $(r_{t+1},\,a_{t+1}) \sim \pi_\theta\bigl(\cdot\mid I,Q,P_{<t}\bigr),$ > > under the standard next-token distribution of the LLM. > > ## 3. Two-Stage Training Methodology > > ### 3.1 Cold-Start Supervised Fine-Tuning (SFT) > > IBISAgent's SFT utilizes 456,000 synthetic reasoning+click trajectories derived from BioMedParseData, each comprising explicit overlay-mask/observation and action pairs, initiated from an empty mask. Oracle-style prompts, delivered by a GPT-5 "teacher," generate structured <think>... reasoning traces. The SFT loss is: $\mathcal{L}_{\rm SFT} = -\sum_{(I,Q,P^*)\in D_{\rm cold}} \sum_{t=1}^T \log \pi_\theta(r_t,a_t\mid I,Q,P^*_{<t}),$ where $P^*$ are ground-truth stepwise trajectories. Self-correction cases are masked out of the loss.

3.2 Agentic Reinforcement Learning (RL)

Subsequent RL is conducted on 886,000 VQA samples (564K with segmentation, 322K standard VQA, no ground-truth action traces), employing a reward function with five components:

Format reward $S_{\mathrm{format}}$ : Validates correct ordering of <think>, <action>, <answer>.
Answer reward $S_{\mathrm{ans}}$ : For segmentation, returns $3$ for $\mathrm{IoU}>0.80$ , $2$ if $0.70<\mathrm{IoU}\le0.80$ , $1$ for $0.50<\mathrm{IoU}\le0.70$ , $0$ otherwise.
Click-placement reward $S_{\mathrm{click}}$ : $+1$ /-1 if positive/negative clicks correctly target FN/FP regions, using

$S_{\mathrm{click}}(a_t) = \begin{cases} +1,& (p_t=+1\wedge x_t\in\Omega_{FN})\vee(p_t=-1\wedge x_t\in\Omega_{FP}),\ -1,&\text{otherwise}. \end{cases}$

Progressive segmentation reward $S_{\mathrm{pseg}}$ : Indicates $+1$ if $\Delta Q_t=\mathrm{IoU}(M_t, M_{gt}) - \mathrm{IoU}(M_{t-1}, M_{gt}) > 0$ .
Trajectory length reward $S_{\mathrm{len}}$ : $+1$ if $T\leq T_{\rm opt}$ , penalized by $-0.2\,(T-T_{\rm opt})$ otherwise.

Normalized final reward: $S = \frac{1}{5}(S_{\mathrm{ans}} + S_{\mathrm{format}} + S_{\mathrm{click}} + S_{\mathrm{pseg}} + S_{\mathrm{len}})$ Agentic RL is optimized with a clipped policy-gradient (GRPO variant): $\mathcal{L}_{\rm RL} = \mathbb{E}\!\left[ -\frac{1}{G}\sum_{i=1}^G \sum_{t=1}^{T_i} \min\bigl( \rho_{i,t}\,\hat A_i, \operatorname{clip}(\rho_{i,t},1-\epsilon,1+\epsilon)\,\hat A_i \bigr) \right]$ with $\hat A_i$ the standardized rollout reward advantage.

IBISAgent's recurrent overlay mechanism ensures pixel-level reasoning is conditioned on the evolving mask state. At each loop, the visual encoder re-extracts features $f_t = \text{VisionEncoder}(o_t)$ from the composite image and incremental mask. This construction enables the MLLM to focus reasoning on unresolved image regions, supporting iterative reduction of false positive/negative mask areas. Each reasoning-action step is thus explicitly grounded in the spatial structure of errors in $M_t$ and permits precision click targeting.

5. Empirical Performance and Ablations

Core Benchmark Results

IBISAgent demonstrates marked improvement over state-of-the-art baselines on diverse in-domain, cross-domain, and held-out settings:

Benchmark	In-domain IoU/DSC/F1	MeCOVQA-G+ IoU/DSC/F1	Held-out IoU/DSC/F1
MedPLIB	85.58 / 92.21 / 96.39	80.63 / 89.27 / 95.24	72.09 / 83.78 / 91.76
Citrus-V	30.61 / 37.63 / 53.75	46.54 / 52.65 / 69.84	32.08 / 38.63 / 50.76
UniBiomed	50.74 / 58.31 / 69.22	24.88 / 31.74 / 43.63	35.62 / 41.55 / 54.97

Training Stage and Reward Ablations

Model	SFT	Reflect	RL	IoU	DSC	F1
Base w/o tool				11.8	16.8	23.5
+SFT only	✓			53.4	62.0	68.6
+Reflect	✓	✓		57.2	67.7	74.5
+RL only			✓	62.8	71.3	77.5
IBISAgent (full)	✓	✓	✓	72.1	83.8	91.8

Reward-signal ablation on MeCOVQA-G+ further corroborates the impact of the tailored RL reward:

$S_{\mathrm{click}}$	$S_{\mathrm{pseg}}$	$S_{\mathrm{len}}$	IoU	Avg. Steps
-	-	-	73.8	11.3
✓	-	-	76.6	10.6
-	✓	-	77.6	8.6
-	-	✓	74.2	5.9
✓	✓	✓	80.6	3.7

6. Generalization and Avoidance of Catastrophic Forgetting

Through a strict decoupling of core MLLM parameters from pixel decoder training, IBISAgent maintains linguistic competency and avoids the catastrophic forgetting observed in approaches with learned segmentation tokens or fused decoders (e.g., LISA-family models). The agentic tool-call interface and iterative RL enable the model to generalize to unseen modalities and out-of-domain settings:

On MeCOVQA-G+ (five modalities not present in SFT data), IBISAgent achieves 80.6 IoU, surpassing Citrus-V (46.5) and UniBiomed (24.9).
On internal held-out data, IBISAgent delivers a 72.1 IoU, compared to the 20.1–35.6 range of prior medical MLLMs.
Removing the RL training phase results in a ~9 IoU decline on held-out samples, demonstrating the necessity of agentic RL for robust generalization.

The iterative, action-conditioned regime, combined with dense, anatomically informed reward signals, fosters visual reasoning grounded in the image content rather than dataset-level statistical regularities.

7. Significance in Biomedical Visual Reasoning

IBISAgent establishes a paradigm for integrating vision-language systems with external tools in a closed-loop, agentic workflow. Its advances are fourfold:

Preservation of base LLM integrity: By requiring no architectural change or new token types, foundational language and reasoning abilities are retained.
Universal tool-based segmentation: Any compatible mask-generating model (e.g., MedSAM2) can serve as the external tool, supporting plug-and-play modularity.
Iteratively refined outputs: Multi-step MDP framing enables mask improvement beyond single-pass approaches.
Empirical superiority and transfer: SOTA results across benchmarks, robust cross-modality transfer, and resilience to catastrophic forgetting position IBISAgent as a leading approach for universal biomedical object referring and segmentation (Jiang et al., 6 Jan 2026).

Markdown Upgrade to Chat

References (1)

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IBISAgent.

IBISAgent: Multimodal Segmentation in Biomedicine

1. Architectural Overview

3.2 Agentic Reinforcement Learning (RL)

4. Iterative Visual Reasoning and Mask Refinement

5. Empirical Performance and Ablations

Core Benchmark Results

Training Stage and Reward Ablations

6. Generalization and Avoidance of Catastrophic Forgetting

7. Significance in Biomedical Visual Reasoning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

IBISAgent: Multimodal Segmentation in Biomedicine

1. Architectural Overview

3.2 Agentic Reinforcement Learning (RL)

4. Iterative Visual Reasoning and Mask Refinement

5. Empirical Performance and Ablations

Core Benchmark Results

Training Stage and Reward Ablations

6. Generalization and Avoidance of Catastrophic Forgetting

7. Significance in Biomedical Visual Reasoning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research