Medical Promptable Concept Segmentation

Updated 26 November 2025

Medical PCS is a paradigm in medical AI that segments anatomical or pathological regions using flexible prompts such as text, clicks, or boxes, reducing the need for extensive annotations.
It employs diverse architectures including dual-classifier methods, explicit prompt encoding, and vision-language models to achieve precise, on-demand mask generation.
PCS enables rapid adaptation to new tasks and interactive refinement, enhancing clinical workflows by minimizing retraining while efficiently handling complex segmentation challenges.

Medical Promptable Concept Segmentation (PCS) is a paradigm in medical AI that enables segmentation of arbitrary anatomical or pathological concepts by conditioning the segmentation process on flexible prompts—textual, visual, or other modalities—at inference time. PCS frameworks are designed to generalize across new tasks and entities with minimal (or no) retraining, reducing expert annotation requirements and allowing precise, on-demand mask generation aligned with expert intentions or downstream clinical needs.

1. Problem Formulation and Defining Objectives

The core objective of Medical PCS is to perform segmentation of medical images conditioned on user, expert, or system prompts that specify a target concept. Prompts may be spatial—such as a point, bounding box, or doodle mark in the image—or semantic, such as a class label or free-form natural language description. In the formalism of (Karam et al., 23 May 2025), given an image $I$ and a prompt $P$ (visual or textual), the PCS system predicts a mask $S = f(I, P; \theta)$ , where $\theta$ are model parameters or, in the case of frozen foundation models, includes learned prompt or adapter weights.

PCS is motivated by several clinical and operational considerations:

Minimizing dependence on fully-labeled datasets by leveraging weak or minimal annotations (e.g., only a handful of pixel-level segmentations or biopsy results).
Enabling task flexibility, such that adding new anatomical regions, pathologies, or expert-defined variant boundaries does not require network retraining.
Supporting interactive workflows where segmentation is refined according to user intent, or where downstream queries (e.g., differential diagnosis, surgical planning) require ad hoc mask generation.

2. Framework Architectures and Prompt-Conditioning Mechanisms

PCS research explores architectures and conditioning mechanisms to realize the desired flexibility and generalization. Several paradigms are prominent:

A. Dual-Classifier Prompt-Guided Search

One approach is to combine a weakly-supervised classifier (e.g., trained on binary presence/absence labels from histology) and a fully-supervised classifier (trained on a small number of expertly segmented cases), each parameterizing a "concept" that can be queried on local patches. A single point prompt seeds a region-of-interest (ROI) search; crops along a predetermined search manifold (e.g., 3D spiral) are scored by a convex combination $S_t = \alpha f(x_c;\theta^*) + (1-\alpha)g(x_c;\phi^*)$ . Positive crops (those with $S_t > \tau$ ) are fused to generate the overall mask, as in (Karam et al., 23 May 2025). This approach uses $<40$ curated training cases and outperforms promptable baselines, matching fully-supervised U-Nets with $100\times$ fewer labeled examples.

B. Explicit Prompt-Encoding in CNNs/U-Nets

Traditional segmentation networks are adapted for PCS by injecting prompt representations into the network. In interactive models, positive and negative clicks are rasterized into spatial maps (binary disks, Euclidean Distance Transforms, or Gaussian kernels) and concatenated as extra channels to the input, as in (Rokuss et al., 29 Aug 2025), or encoded through dedicated prompt branches with attention-based fusion, as in PE-MED (Chang et al., 2023). Prompt attention modules and iterative refinement pipelines further enhance responsiveness and stability to user input.

C. Vision-Language and Multimodal Foundation Models

PCS leverages large pre-trained vision-language networks for open-vocabulary or text-prompt segmentation. MedSAM-3 (Liu et al., 24 Nov 2025) extends SAM-3 by swapping geometric prompt pathways for a text encoder and cross-attention-based decoder. Inputs include medical images and text prompts (e.g., "lung infection"), enabling pixel-accurate masks for diverse concepts without retraining for each class.

Motifs include:

Adapter-based dual-prompt networks (e.g., CAT (Huang et al., 2024), MVP (Chen et al., 2024)): anatomical exemplars (visual prompts) and textual class descriptions are fused via transformers or MLPs at all decoder stages, with hard/soft prompt assignment based on prompt modality.
Zero-shot/few-shot interaction: Specialist-generalist frameworks (SemiSAM+ (Zhang et al., 28 Feb 2025)) use promptable foundation models as frozen generalists to generate pseudo-labels for task-specific specialists, orchestrated by prompts derived from (possibly coarse) predictions.

D. Patch- or Token-Level Prompting

Some architectures rely on learning prompt tokens that condition all parts of a frozen backbone (as in PUNet (Fischer et al., 2022)), scaling up to a prompt set per class/task, injected at each shifted-window transformer block and aggregated via cosine similarity in the final output head. This enables parameter-efficient adaptation (≈1% of parameters for new classes) and enables privacy-preserving personalization.

E. Free-Form Natural Language and Flexible Pathology

Recent PCS models generalize the interface, parsing free-text prompts via small LLMs (e.g., TinyLlama, CLIP) or LLM planners (MedSAM-3 Agent) to orchestrate complex, open-ended interactions (Liu et al., 24 Nov 2025, Cui et al., 2024). The design supports queries such as "Where is the tumor in the upper right quadrant?" or "Segment all nuclei outside the glomerular tuft," enabling both segmentation and spatial reasoning (Trinh et al., 17 May 2025, Cui et al., 2024).

3. Prompt Representation: Types, Encoding, and Conditioning

PCS systems differ substantially in prompt types and encodings.

Prompt Type	Common Representation	Example Models
Point/Sparse Click	Binary disk, EDT, Gaussian map	(Rokuss et al., 29 Aug 2025, Chang et al., 2023)
Doodle/Scribble	Rasterized "doodle" channel	(Zami et al., 1 Jul 2025)
Bounding Box	Binary/normalized mask, coordinate tensor	(Danielou et al., 10 Jul 2025, Liu et al., 24 Nov 2025)
Text / Semantic	CLIP/BERT-based text embeddings	(Liu et al., 24 Nov 2025, Huang et al., 2024)
Anatomical Exemplars	3D cropped volume, Swin encoder	(Huang et al., 2024)

Encoding strategies vary:

EDT encoding of clicks yields robust gradient signals, outperforming Gaussian in interactive tasks (Rokuss et al., 29 Aug 2025).
In visual prompting (MVP), superpixel features and patch embeddings fused through adapters allow flexible localization of lesions with shape priors (Chen et al., 2024).
In hybrid multi-modal systems, prompt tokens are injected per transformer block, concept IDs are embedded via learnable tables, or natural language is processed with fine-tuned LLMs or BERTs (PFPs (Cui et al., 2024)).

Textual prompts are pooled over tokens and projected for cross-attention in multi-modal decoders (MedSAM-3), or further mapped to latent code vectors controlling mask style (e.g., inclusive or conservative margins, as in ProSona (Elgebaly et al., 11 Nov 2025)).

4. Training Data Regimes and Annotation Efficiency

PCS aims to dramatically reduce the requirements for expert-annotated segmentation masks:

Weakly-supervised and minimal training setups (as in (Karam et al., 23 May 2025)) demonstrate that reliable concept classifiers for PCS can be trained with as few as 24 fully-segmented images (consensus, pixel-perfect) plus 8 MR volumes labeled only with binary histology, compared to the hundreds or thousands traditionally required.
Foundation and semi-supervised models (e.g., SemiSAM+) leverage prompt-driven pseudo-labels from generalist models, bridging the performance gap in extremely low-shot settings—attaining 49% Dice on left atrium segmentation with only a single labeled case (Zhang et al., 28 Feb 2025).
In language-centric frameworks, text prompts (rather than new mask annotations) define new semantic concepts or task variants with near-zero labeling cost (Liu et al., 24 Nov 2025, Cui et al., 2024).

Some models accept flexible, compositional tasks (e.g., "segment union/intersection of structures," "nucleus inside/outside region at point"), with ground-truths automatically synthesized by logical operations on base masks (Cui et al., 2024).

5. Quantitative Performance and Comparative Analyses

PCS frameworks are benchmarked using Dice similarity, Intersection-over-Union (IoU), and task-specific metrics (e.g., false positive/negative volumes, Generalized Energy Distance for multi-rater studies). Representative outcomes include:

Model / Dataset	Dice (Mean)	Annotation Regime	Notable Points
PCS (24+8 labels) (Karam et al., 23 May 2025)	0.3085 ± 0.14	24 expert seg + 8 weak	100× fewer labels; non-inferior to U-Net (0.3275, 200 labels)
nnU-Net ct/organs	up to 0.89	Full supervision	Gold standard; not promptable
Prompt2SegCXR (Zami et al., 1 Jul 2025)	0.816 (all)	5.5k prompt–image–mask	23 organ/disease classes; outperforms SAM by 6–10% Dice
CAT (multi-organ) (Huang et al., 2024)	0.868	10 CT datasets	Visual+textual prompt synergy, robust to rare organs/tumors
PFPs (free-text prompt) (Cui et al., 2024)	0.65–0.78	Multi-task pathology	Generalizes to unseen tasks/classes, LLM variability
MedSAM-3 T+I (Liu et al., 24 Nov 2025)	0.77–0.90	15 open datasets	Text+box best; agentic loop improves difficult cases
SemiSAM+ (MRI/CT) (Zhang et al., 28 Feb 2025)	0.48–0.73	1–5 labeled, rest unlabeled	Strongest in low-supervision regimes
PRISM (3D, plain/ultra) (Li et al., 2024)	0.67–0.94	1 pt, boxes, scribbles	Surpasses automatic and SAM-like 3D baselines

Performance is sensitive to prompt encoding (EDT preferred over Gaussian for clicks (Rokuss et al., 29 Aug 2025)), prompt type (box+text > text alone (Liu et al., 24 Nov 2025)), and the diversity of training data and prompt types supporting the conditional channels.

6. Generalization, Clinical Deployment, and Future Directions

PCS frameworks are inherently suited for:

Extension to new anatomical structures, pathologies, or imaging modalities when a small set of well-curated annotations is feasible (Karam et al., 23 May 2025).
Interactive, agentic, or natural-language-guided workflows in radiology or pathology, where clinicians drive segmentation through clicks, text, or composite queries (Liu et al., 24 Nov 2025, Trinh et al., 17 May 2025).
Settings with high inter-observer variability, supporting expert personalization via prompt-conditioned mask generation (e.g., ProSona’s natural-language driven style interpolation for multi-rater delineation (Elgebaly et al., 11 Nov 2025)).
Efficient deployment in resource-constrained environments, as the minimal annotation and parameter-efficient prompt-tuning (e.g., ≈0.8% in PUNet (Fischer et al., 2022)) eliminate retraining full models for every new task.

Anticipated advances include:

Support for arbitrarily complex prompt types (multi-turn language, bounding box + text, image context).
Stronger domain adaptation/robustness to OOD prompts and rare entities.
Incorporation of richer spatial reasoning, causal relationships, and clinician-in-the-loop refinement (PRS-Med (Trinh et al., 17 May 2025)).
Fused pipelines for sequence/video segmentation, linking object tracking and promptable segmentation (TPP (Yuan et al., 16 Feb 2025)).

7. Limitations, Challenges, and Outlook

PCS architectures, while promising, face several open challenges:

Performance for small, low-contrast, or ambiguous lesions, particularly when prompts are imprecise.
Sensitivity to prompt modality (text only, Doodle, Box), quality, and encoding details.
Generalizability to unseen or out-of-distribution anatomy depends on backbone pretraining and prompt diversity.
Quantitative gains may saturate beyond a minimal prompt-token/adapter capacity (e.g., see PUNet ablations (Fischer et al., 2022)).
The explicit handling of multi-expert disagreement, conceptual ambiguity, or compound prompt requests remains an active area (Elgebaly et al., 11 Nov 2025).

PCS research is advancing toward unified, multimodal and highly adaptive segmentation solutions, bringing flexible, efficient, and personalized AI assistance closer to routine clinical workflows. The field is distinguished by techniques that blend local- and global-context encoding, parameter-efficient adaptation, and hybrid prompt conditioning, validated across a growing array of organs, pathologies, and imaging modalities. Emerging agentic and foundation model–driven pipelines are poised to further reduce annotation costs and unlock new forms of clinician–system collaboration and explainability.