Instruction-Guided Lesion Segmentation

Updated 21 November 2025

ILS is a paradigm that conditions lesion segmentation on human instructions and imaging data to yield semantically and spatially precise predictions.
It leverages unified architectures such as U-Net and vision-language models along with organ-aware supervision to merge automated and expert-guided corrections.
Empirical evaluations in PET/CT and CXR modalities show improved Dice scores and reduced false positives/negatives with iterative user interaction.

Instruction-Guided Lesion Segmentation (ILS) refers to a paradigm wherein lesion segmentation models are explicitly conditioned on human-provided instructions—either in the form of user interactions (e.g., clicks) or natural-language descriptions—to yield semantically and spatially precise predictions. ILS integrates interactive, anatomical, and language-based supervision to maximize both automation and expert controllability in diverse imaging settings, including PET/CT and chest X-ray (CXR) modalities. ILS approaches unify classical U-Net architectures, vision-LLMs (VLMs), organ-aware auxiliary heads, and task-specific user prompting to deliver precise lesion boundaries while allowing iterative expert correction or specification (Huang et al., 2 Sep 2025, Choi et al., 19 Nov 2025).

1. Core Principles and Formal Definitions

At its core, ILS conditions the segmentation process on both the medical image and explicit instructions. The formal definition on chest X-ray imaging specifies the inputs as $(x, T)$ , where $x \in \mathbb{R}^{H \times W}$ is the image and $T$ is a natural-language instruction denoting the desired lesion type (e.g., "pneumonia," "cardiomegaly") and an optional anatomical location (e.g., "right lung base") (Choi et al., 19 Nov 2025). The model outputs both a lesion mask $M \in [0,1]^{H \times W}$ and a textual description $Y$ that confirms the finding, optionally detailing certainty, location, or (for inference tasks) a predicted lesion type.

In PET/CT ILS, expert guidance is provided by sequential, spatially localized clicks categorized as foreground (lesion) or background (non-lesion). These are encoded as 3D Gaussian maps and concatenated as additional channels alongside the anatomical (CT) and functional (PET) images. This multi-modal tensor constitutes the model's input for each iteration, allowing for dynamic expert-directed correction (Huang et al., 2 Sep 2025).

2. Model Architectures and Training Regimes

PET/CT ILS via Residual Encoder U-Net

The primary architecture is a 3D Residual Encoder U-Net (ResEnc-UNet) based on nnU-Net v2. The encoder features five levels, each comprising two residual convolutional blocks (Conv $3\times 3\times 3$ → GroupNorm → LeakyReLU ×2 with skip connections), with inter-level $2\times$ stride downsampling. The decoder reverses this structure with transposed convolutions and skip concatenations (Huang et al., 2 Sep 2025).

A dual-head output is implemented: one head segments lesions (softmax for binary mask), whereas an auxiliary head provides organ supervision (softmax over 10 anatomical classes: liver, spleen, kidneys, bladder, lung, brain, heart, stomach, prostate, head/glands).

Composite loss:

Lesion segmentation: $L_{\text{lesion}} = \alpha \cdot L_{\text{Dice}}(p, g) + \beta \cdot L_{\text{CE}}(p, g)$ .
Organ supervision: $L_{\text{organ}} = \frac{1}{10} \sum_{j=1}^{10} [L_{\text{Dice}}(p^{j}, g^{j}) + L_{\text{CE}}(p^{j}, g^{j})]$ .
Total: $L_{\text{total}} = L_{\text{lesion}} + \lambda \cdot L_{\text{organ}}$ .

Hyperparameters (typical): $\alpha=\beta=1$ for FDG; $\alpha:\beta=2:1$ for PSMA; $\lambda=1$ .

CXR ILS via Vision-LLMs

For CXR, ILS employs a VLM backbone and a mask decoder:

Example: ROSALIA model, leveraging LLaVA and SAM (Choi et al., 19 Nov 2025). The VLM receives both image and text instruction and generates a hidden [SEG] token encoding segmentation intent; this token's embedding is used by the mask decoder (SAM) for spatial prediction.

Loss: $L = \lambda_{\text{txt}} L_{\text{txt}} + (5 L_{\text{BCE}} + 1 L_{\text{Dice}})$ , with $\lambda_{\text{txt}} = 0.5$ ; AdamW optimization; data-level balancing (1:1 positive:negative).

3. Instruction Encoding and Interactive Guidance

User Interactions in PET/CT

Foreground and background clicks are simulated via 3D Gaussians ( $\sigma \approx 3$ mm), facilitating both supervised training and efficient inference. At each refinement step, the four input channels are:

CT (normalized)
PET (normalized)
Foreground click map
Background click map

During inference, an iterative human-in-the-loop loop selects new clicks on the largest connected components of current false negative (FN) and false positive (FP) regions, refining segmentation until convergence or a click budget is exhausted. Thresholds for binarization are tracer-specific (1.5 SUV FDG, 1.0 PSMA) (Huang et al., 2 Sep 2025).

Natural Language in CXR

Instructions are processed as templated prompts ("Segment the [Target] in the [Location].", etc.) (Choi et al., 19 Nov 2025). The VLM processes $x, T$ ; cross-modal fusion is achieved by using the [SEG] token embedding as prompting for the mask decoder. Automated pipelines generate valid (Instruction, Answer) pairs encompassing presence/absence, region-level granularity, and lesion-inference variants.

4. Automated Dataset Construction

MIMIC-ILS for Chest X-ray

Source: MIMIC-CXR (192K PA/AP frontal images)
Supervision: Structured reports parsed by LLMs (Mistral-Small-3.1-Instruct, medgemma-27B) yield tuples: (entity, present, certainty, normalized location, lesion type).
Candidate lesion regions: Derived via RadEdit anomaly maps, YOLO box proposals, and CXAS anatomy segmentations.
Mask generation: Algorithmic gating via intersection-over-union, YOLO box confidence, anomaly scores, and size thresholds (e.g., $\tau_{\text{anatomy}} = 0.25$ , $\tau_{\text{conf}}=0.2$ , $\tau_{\text{signal}}=0.2$ , $\tau_{\text{size}}=0.10$ [general case]) (Choi et al., 19 Nov 2025).

Approximately 1.1M instruction–answer pairs are generated, encompassing seven major lesion types. The dataset enables both mask and text evaluation, with a test set of 12K (physician-verified) pairs and an expert acceptance rate exceeding $96\%$ .

PET/CT Preparation

Preprocessing includes robust normalization (CT: percentile clipping and scaling; PET, clicks: z-scoring), plus nnU-Net–standard augmentations. Tracer classification (FDG vs. PSMA) is performed using two ResNet-50s on coronal/sagittal MIPs, with 100% accuracy on the training set (Huang et al., 2 Sep 2025). Both unified and tracer-specific models are trained, but the unified model with organ supervision offers superior multi-center robustness.

5. Quantitative Evaluation and Comparative Performance

For PET/CT ILS (Huang et al., 2 Sep 2025):

Metrics: Dice coefficient, false positive volume (FPV), false negative volume (FNV).
PSMA: Dice improves from 0.62 (0 clicks) to 0.87 (10 clicks); FPV reduces from ~1.0 cc to ~0.33 cc; FNV shrinks from ~5.6 cc to ~1.6 cc.
FDG: Dice increases from 0.74 to 0.89 at 10 clicks; FNV improves from ~8.9 cc to ~1.6 cc.
Guidance reduces FNV by >80% at 10 clicks and adds 0.15–0.25 Dice; organ supervision alone adds ~0.05 Dice and reduces FPV by ~15%.

For CXR ILS (Choi et al., 19 Nov 2025), using the MIMIC-ILS test set:

ROSALIA model achieves gIoU 71.2, cIoU 75.6, and N-Acc 91.8%.
Per-lesion example (gIoU, cIoU, N-Acc): Cardiomegaly (89.0, 89.0, 85.8%), Pneumonia (57.2, 60.4, 97.1%), Edema (64.8, 66.6, 92.2%).
Textual answer exact match: Basic prompts 96.8%, Global 88.8%, Lesion Inference 84.8%.

Baseline comparisons show ROSALIA substantially outperforms LISA-7B, Text4Seg, PixelLM-13B, BiomedParse, RecLMIS, and IMIS-Net in both segmentation and textual metrics.

Model	gIoU	cIoU	N-Acc
LISA-7B	8.3	12.8	0.7%
Text4Seg	6.1	10.3	20.6%
PixelLM-13B	12.8	15.4	0%
BiomedParse	23.8	18.5	0.6%
RecLMIS	22.4	19.5	0%
IMIS-Net	9.8	11.8	21.6%
ROSALIA	71.2	75.6	91.8%

6. Empirical Analysis and Model Ablations

Qualitative evaluation demonstrates ILS models segment only the requested lesion class per prompt and can produce contextually independent masks for each entity present in the image (Choi et al., 19 Nov 2025). In PET/CT, each added foreground/background click incrementally improves the alignment of the predicted contour with expert annotation, often tightening boundaries around small-volume or sub-centimeter metastases (Huang et al., 2 Sep 2025).

Ablation on PET/CT reveals:

Organ supervision suppresses over-segmentation (e.g., in liver/bladder, reducing FPV by 20% and maintaining zero-guidance Dice).
Dense click curriculum alone causes collapse at 0–2 clicks; stochastic sampling ensures smooth Dice–click response.
Organ head provides additive gains, especially in low-interaction regimes.

For CXR, limitations are noted in lesion-inference ambiguity (opacity-only interpretations yield ~75% accuracy), dependence on pre-trained anomaly detectors and YOLO proposals, and current restriction to seven lesion types. Textual outputs are rigidly templated; richer clinical expressivity is a stated future objective.

7. Potential Extensions and Future Directions

Future directions identified include:

Incorporation of additional metadata (radiographic view, patient demographics) for enhanced lesion-type inference (Choi et al., 19 Nov 2025).
Unification of segmentation and detection via end-to-end trained diffusion models.
Expansion of ILS frameworks to other imaging modalities (CT, MRI) and multi-view series.
Leveraging advanced multimodal LLMs for richer, context-aware explanations.
Prospective real-world clinical evaluation is necessary for broad workflow integration.

Instruction-Guided Lesion Segmentation thus constitutes a modular paradigm for controllable, high-fidelity image understanding, blending deep interactive learning, anatomical grounding, and flexible multimodal prompting (Huang et al., 2 Sep 2025, Choi et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

autoPET IV challenge: Incorporating organ supervision and human guidance for lesion segmentation in PET/CT (2025)

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset (2025)

Follow Topic

Get notified by email when new papers are published related to Instruction-Guided Lesion Segmentation (ILS).