Papers
Topics
Authors
Recent
2000 character limit reached

IBISAgent: Multimodal Segmentation in Biomedicine

Updated 13 January 2026
  • IBISAgent is an agentic multimodal LLM framework that reformulates biomedical segmentation as an iterative, vision-centric decision process.
  • It leverages an overlay mechanism and external segmentation tools to iteratively refine masks, yielding robust generalization and high accuracy.
  • The design decouples base LLM capabilities from segmentation training, preserving linguistic reasoning and mitigating catastrophic forgetting.

IBISAgent is an agentic multimodal LLM (MLLM) framework designed to reinforce pixel-level visual reasoning for universal biomedical object referring and segmentation. Departing from prior approaches that integrate segmentation capabilities via implicit tokens or trainable vision-language fusion modules, IBISAgent reformulates segmentation as a vision-centric, multi-step decision process, engaging off-the-shelf segmentation tools in a loop of interleaved reasoning and action outputs. The architecture achieves robust generalization across domains, state-of-the-art segmentation accuracy, and mitigates the risk of catastrophic forgetting inherent in tightly coupled LLM-decoder training paradigms (Jiang et al., 6 Jan 2026).

1. Architectural Overview

IBISAgent operates as a control layer above a (frozen or lightly fine-tuned) base MLLM such as Qwen2.5-VL-7B, with no architectural modifications to the underlying model. There is no introduction of new fusion layers or implicit segmentation tokens; IBISAgent leverages the original next-token generation interface:

  • Interaction Loop: At each decision step tt, the model receives an overlayed image ot=Overlay(I,Mt)o_t = \text{Overlay}(I, M_t), a semi-transparent blend of the raw image II and the current mask MtM_t. The MLLM outputs:
    • > ... r_t ... </think>: internal, text-based reasoning for visual focus. > - <action> ... a_t ... </action>: a click action at=(Target,pt{+1,1},(xt,yt))a_t = (\text{Target}, p_t \in \{+1,-1\}, (x_t, y_t)). > > - Tool Integration: The click sequence {a0at}\{a_0 \ldots a_t\} and prior mask MtM_t are passed to a segmentation model FsegF_{\mathrm{seg}} (e.g., MedSAM2). The tool produces an updated segmentation mask Mt+1M_{t+1}. > > - Iteration: The updated overlay observation ot+1o_{t+1} is returned to the MLLM, repeating the reasoning-action cycle until a terminal <answer>...</answer> is emitted. > > This vision-centric agentic approach avoids tightly coupling external decoders or LLM weights, preserving the text output space and enabling off-the-shelf tool integration. > > ## 2. Formalization as a Multi-Step Markov Decision Process > > Segmentation is cast as a TT-step Markov Decision Process (MDP), with the agentic state at step tt defined as > > st=(I,Q,P<t,Mt),s_t = (I, Q, P_{<t}, M_t), > > where QQ is the input query, P<tP_{<t} is the full action-reasoning-observation history up to t1t-1, and MtM_t is the current segmentation mask. The action space A\mathcal{A} consists of either a final-answer output or a click action ata_t. The transitions follow: ot=Overlay(I,Mt),Mt+1=Fseg(I;{a0at},Mt),st+1=(I,Q,P<t+1,Mt+1).o_t = \text{Overlay}(I, M_t), \quad M_{t+1} = F_{\text{seg}}(I; \{a_0 \ldots a_t\}, M_t), \quad s_{t+1} = (I, Q, P_{<t+1}, M_{t+1}). > The agent samples its next internal reasoning and action from its policy > > (rt+1,at+1)πθ(I,Q,P<t),(r_{t+1},\,a_{t+1}) \sim \pi_\theta\bigl(\cdot\mid I,Q,P_{<t}\bigr), > > under the standard next-token distribution of the LLM. > > ## 3. Two-Stage Training Methodology > > ### 3.1 Cold-Start Supervised Fine-Tuning (SFT) > > IBISAgent's SFT utilizes 456,000 synthetic reasoning+click trajectories derived from BioMedParseData, each comprising explicit overlay-mask/observation and action pairs, initiated from an empty mask. Oracle-style prompts, delivered by a GPT-5 "teacher," generate structured <think>... reasoning traces. The SFT loss is: LSFT=(I,Q,P)Dcoldt=1Tlogπθ(rt,atI,Q,P<t),\mathcal{L}_{\rm SFT} = -\sum_{(I,Q,P^*)\in D_{\rm cold}} \sum_{t=1}^T \log \pi_\theta(r_t,a_t\mid I,Q,P^*_{<t}), where PP^* are ground-truth stepwise trajectories. Self-correction cases are masked out of the loss.

3.2 Agentic Reinforcement Learning (RL)

Subsequent RL is conducted on 886,000 VQA samples (564K with segmentation, 322K standard VQA, no ground-truth action traces), employing a reward function with five components:

  • Format reward SformatS_{\mathrm{format}}: Validates correct ordering of <think>, <action>, <answer>.
  • Answer reward SansS_{\mathrm{ans}}: For segmentation, returns $3$ for IoU>0.80\mathrm{IoU}>0.80, $2$ if 0.70<IoU0.800.70<\mathrm{IoU}\le0.80, $1$ for 0.50<IoU0.700.50<\mathrm{IoU}\le0.70, $0$ otherwise.
  • Click-placement reward SclickS_{\mathrm{click}}: +1+1/-1 if positive/negative clicks correctly target FN/FP regions, using

Sclick(at)={+1,(pt=+1xtΩFN)(pt=1xtΩFP), 1,otherwise.S_{\mathrm{click}}(a_t) = \begin{cases} +1,& (p_t=+1\wedge x_t\in\Omega_{FN})\vee(p_t=-1\wedge x_t\in\Omega_{FP}),\ -1,&\text{otherwise}. \end{cases}

  • Progressive segmentation reward SpsegS_{\mathrm{pseg}}: Indicates +1+1 if ΔQt=IoU(Mt,Mgt)IoU(Mt1,Mgt)>0\Delta Q_t=\mathrm{IoU}(M_t, M_{gt}) - \mathrm{IoU}(M_{t-1}, M_{gt}) > 0.
  • Trajectory length reward SlenS_{\mathrm{len}}: +1+1 if TToptT\leq T_{\rm opt}, penalized by 0.2(TTopt)-0.2\,(T-T_{\rm opt}) otherwise.

Normalized final reward: S=15(Sans+Sformat+Sclick+Spseg+Slen)S = \frac{1}{5}(S_{\mathrm{ans}} + S_{\mathrm{format}} + S_{\mathrm{click}} + S_{\mathrm{pseg}} + S_{\mathrm{len}}) Agentic RL is optimized with a clipped policy-gradient (GRPO variant): LRL=E ⁣[1Gi=1Gt=1Timin(ρi,tA^i,clip(ρi,t,1ϵ,1+ϵ)A^i)]\mathcal{L}_{\rm RL} = \mathbb{E}\!\left[ -\frac{1}{G}\sum_{i=1}^G \sum_{t=1}^{T_i} \min\bigl( \rho_{i,t}\,\hat A_i, \operatorname{clip}(\rho_{i,t},1-\epsilon,1+\epsilon)\,\hat A_i \bigr) \right] with A^i\hat A_i the standardized rollout reward advantage.

4. Iterative Visual Reasoning and Mask Refinement

IBISAgent's recurrent overlay mechanism ensures pixel-level reasoning is conditioned on the evolving mask state. At each loop, the visual encoder re-extracts features ft=VisionEncoder(ot)f_t = \text{VisionEncoder}(o_t) from the composite image and incremental mask. This construction enables the MLLM to focus reasoning on unresolved image regions, supporting iterative reduction of false positive/negative mask areas. Each reasoning-action step is thus explicitly grounded in the spatial structure of errors in MtM_t and permits precision click targeting.

5. Empirical Performance and Ablations

Core Benchmark Results

IBISAgent demonstrates marked improvement over state-of-the-art baselines on diverse in-domain, cross-domain, and held-out settings:

Benchmark In-domain IoU/DSC/F1 MeCOVQA-G+ IoU/DSC/F1 Held-out IoU/DSC/F1
MedPLIB 85.58 / 92.21 / 96.39 80.63 / 89.27 / 95.24 72.09 / 83.78 / 91.76
Citrus-V 30.61 / 37.63 / 53.75 46.54 / 52.65 / 69.84 32.08 / 38.63 / 50.76
UniBiomed 50.74 / 58.31 / 69.22 24.88 / 31.74 / 43.63 35.62 / 41.55 / 54.97

Training Stage and Reward Ablations

Model SFT Reflect RL IoU DSC F1
Base w/o tool 11.8 16.8 23.5
+SFT only 53.4 62.0 68.6
+Reflect 57.2 67.7 74.5
+RL only 62.8 71.3 77.5
IBISAgent (full) 72.1 83.8 91.8

Reward-signal ablation on MeCOVQA-G+ further corroborates the impact of the tailored RL reward:

SclickS_{\mathrm{click}} SpsegS_{\mathrm{pseg}} SlenS_{\mathrm{len}} IoU Avg. Steps
- - - 73.8 11.3
- - 76.6 10.6
- - 77.6 8.6
- - 74.2 5.9
80.6 3.7

6. Generalization and Avoidance of Catastrophic Forgetting

Through a strict decoupling of core MLLM parameters from pixel decoder training, IBISAgent maintains linguistic competency and avoids the catastrophic forgetting observed in approaches with learned segmentation tokens or fused decoders (e.g., LISA-family models). The agentic tool-call interface and iterative RL enable the model to generalize to unseen modalities and out-of-domain settings:

  • On MeCOVQA-G+ (five modalities not present in SFT data), IBISAgent achieves 80.6 IoU, surpassing Citrus-V (46.5) and UniBiomed (24.9).
  • On internal held-out data, IBISAgent delivers a 72.1 IoU, compared to the 20.1–35.6 range of prior medical MLLMs.
  • Removing the RL training phase results in a ~9 IoU decline on held-out samples, demonstrating the necessity of agentic RL for robust generalization.

The iterative, action-conditioned regime, combined with dense, anatomically informed reward signals, fosters visual reasoning grounded in the image content rather than dataset-level statistical regularities.

7. Significance in Biomedical Visual Reasoning

IBISAgent establishes a paradigm for integrating vision-language systems with external tools in a closed-loop, agentic workflow. Its advances are fourfold:

  • Preservation of base LLM integrity: By requiring no architectural change or new token types, foundational language and reasoning abilities are retained.
  • Universal tool-based segmentation: Any compatible mask-generating model (e.g., MedSAM2) can serve as the external tool, supporting plug-and-play modularity.
  • Iteratively refined outputs: Multi-step MDP framing enables mask improvement beyond single-pass approaches.
  • Empirical superiority and transfer: SOTA results across benchmarks, robust cross-modality transfer, and resilience to catastrophic forgetting position IBISAgent as a leading approach for universal biomedical object referring and segmentation (Jiang et al., 6 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to IBISAgent.