One-Prompt Segmentation Overview

Updated 7 January 2026

One-prompt segmentation is a visual paradigm that converts a single prompt (text, image, or cue) into detailed pixel-wise masks without iterative user input.
It leverages advanced architectures like SAM and GroundingDINO to achieve high accuracy in both natural and specialized domains such as medical imaging and remote sensing.
Techniques such as prompt engineering, cross-attentional fusion, and cycle-consistency enable robust, efficient segmentation even in complex visual scenarios.

One-prompt segmentation refers to a family of visual segmentation paradigms and algorithms that produce meaningful pixel-wise object masks in response to a single, user-formulated prompt. The prompt may take various forms—including natural language, sparse spatial cues, image exemplars, or generic task descriptions—but crucially, the segmentation system is expected to select, delineate, and label regions in complex images from that single query, dispensing with the need for multiple manual hints, repetitive template-labelling, or class-specific fine-tuning. One-prompt segmentation has become increasingly prominent with the availability of foundation models such as Segment Anything Model (SAM), GroundingDINO, and vision-language transformers, all of which fundamentally alter the prompt–segmentation interaction paradigm in both natural and domain-specific imagery (Ahmad, 10 Sep 2025, Wu et al., 2023, Tang et al., 2023).

1. Problem Formulation and Paradigm

The core principle of one-prompt segmentation is the transformation of a single prompt—textual, visual, or other—into a set of segmentation masks in an image or batch of images, with end-to-end minimal user interaction. The field encompasses several sub-paradigms:

Text-prompted segmentation: A natural-language phrase (e.g., "red car" or "segment the liver") is supplied, and the model returns the relevant binary or panoptic mask(s) (Ahmad, 10 Sep 2025, Lüddecke et al., 2021).
Image-prompted segmentation (one-shot): A single support image (optionally with mask or sparse cue) specifies the target concept; the model then segments corresponding objects in all test images (Wu et al., 2023, Tang et al., 2023).
Task-generic promptable segmentation: A single generic instruction (e.g., "camouflaged animal") is provided for a batch; the system infers individual segmentation masks across diverse images via intermediate prompt inference (Hu et al., 2024).

Formally, one can characterize the mapping as $(I, P) \rightarrow M$ , where $I$ is a target image, $P$ is the prompt (text, image, or sparse cue), and $M$ is the output mask (or masks). Architectural and algorithmic instantiations vary, but the defining constraint is single-step, single-query segmentation with high data and user-efficiency.

2. Architectures and Workflow Variants

Text-driven Pipelines

Systems such as the Locate–Segment–Inpaint–Describe (LSID) pipeline (Ahmad, 10 Sep 2025) bridge natural language prompts and dense masks via a multi-stage, lifecycle-aware design:

Detection: GroundingDINO ingests the text prompt $p_\mathrm{edit}$ , predicting bounding boxes $\{b_i\}$ with confidence $s_i$ , filtered by detection and alignment thresholds ( $\tau_\mathrm{det}, \tau_\mathrm{txt}$ ).
Segmentation: Each detected box $b_i$ is refined by SAM to a binary mask $M_i$ , binarized at $t_\mathrm{bin}=0.5$ , then combined and morphologically refined.
Intermediate artifact retention: All stages produce and store artifacts (detections, binary masks, overlays) for transparency and reproducibility.
Operational best practices: Includes threshold tuning and reproducibility controls (seed, version pinning, CLI & UI equivalence).

For the one-prompt “locate→segment” task, the effective algorithmic loop is:

function LSID_oneprompt_segmentation(I, p_edit, config):
    # Detection with open-vocab detector
    ...
    # Segmentation with mask model
    ...
    # Merge, morphological post-processing, artifact persistence
    ...
    return refined_mask

Typical configuration:

\tau_\mathrm{det} = 0.5

\tau_\mathrm{txt} = 0.35

t_\mathrm{bin} = 0.5

k = 3

(morph. kernel).

Visual and Image-based Prompting

Several systems eschew text and rely on a visual template or "reference exemplar" as the single prompt:

In the One-Prompt Model for universal medical segmentation (Wu et al., 2023), both query and template images are encoded; the single template, annotated with a sparse prompt (click, box, doodle, mask), guides a cross-attentional transformer (the One-Prompt-Former) that merges query and prompt features into dense logits—no further fine-tuning or user action required.
Training-free approaches (e.g., IPSeg (Tang et al., 2023), Segment Using Just One Example (Vora et al., 2024), GBMSeg (Liu et al., 2024), OP-SAM (Mao et al., 22 Jul 2025)) exploit image-level feature matching or patch-wise cross-correlation (via DINOv2, EfficientNet, or ResNet50), initializing SAM or similar mask models by automatically generating positive and negative prompts through feature similarity, clustering, and cycle-consistency—frequently achieving high one-shot performance without any offline retraining.

Task-generic and Multimodal Prompt Cycles

Advanced designs employ a single generic prompt for a collection of related images and iteratively refine instance-specific hints in a closed loop:

ProMaC (Hu et al., 2024) uses an MLLM to expand a task-generic description (e.g., “camouflaged animal”) into plausible per-image region proposals ("hallucinations"), then reduces and aligns these with mask confidence cycles—using prompt–mask–prompt bootstrapping to convergently focus the segmentation process.

3. Algorithmic Techniques for Prompt-to-Mask Mapping

Key innovations in one-prompt segmentation include:

Open-vocabulary grounding: CLIP-based text encoders (e.g., in GroundingDINO (Ahmad, 10 Sep 2025)) to align arbitrary phrases with image regions via transformer attention.
Prompt engineering: Automated derivation of point, mask, or box prompts from a single reference mask via feature affinity (cosine similarity, patch-wise correlation), spatial pruning (cycle-consistency, exclusion, sparsification), and morphological processing (Liu et al., 2024, Mao et al., 22 Jul 2025).
Cross-attentional fusion: Simultaneous processing of support and query (or prompt and target) in a shared embedding space, often via MLP, MLP+Gaussian masking, or cascaded self- and cross-attention modules (Wu et al., 2023).
Ensemble and cycle-based strategies: Test-time averaging of outputs across random prompt instantiations or iterative self-updating of prompts and masks (warping, self-training, or hallucination refinement) (Yoon et al., 2024, Hu et al., 2024).
Meta-learning architectures: Episodically trained UNet-based models that directly learn a mapping from prompt–query pairs to change (anomaly) masks without reliance on vision-language bridges (Gao, 14 May 2025).

4. Quantitative Performance and Benchmarks

One-prompt segmentation methods achieve competitive or state-of-the-art results across many domains:

General-purpose images: The LSID pipeline (Ahmad, 10 Sep 2025) yields over 90% "usable masks" (IoU≥0.80) with single-word prompts, with manual inspection confirming above 85% accuracy in a trial set.
Medical image segmentation: One-Prompt Model (Wu et al., 2023) outperforms interactive and few-shot methods on 14 unseen datasets (+10.7% Dice over best SAM approach; 64.0% average Dice vs. 77.2% supervised upper bound). OP-SAM (Mao et al., 22 Jul 2025) attains 76.93% IoU (+11.44% vs. best prior) in one-shot polyp segmentation across five datasets; Med-PerSAM (Yoon et al., 2024) achieves DICE scores up to 92.0% on chest X-ray segmentation.
Anomaly and change detection: MetaUAS (Gao, 14 May 2025) achieves pixel-wise average precision (P-PR) of 59.3% (MVTec), outperforming other one-shot and zero-shot methods despite using only a single prompt image and no language.
Satellite/remote sensing: Segment Using Just One Example (Vora et al., 2024) reports IoU of 0.6930 for buildings (exceeding supervised U-Net baseline, 0.3214) using confidence-weighted ensemble on automatically prompted SAM pipelines.

Typical ablations confirm that prompt mechanism details (prompt type, clustering thresholds, ensemble strategy) and backbone feature choice (DINOv2+SD for IPSeg (Tang et al., 2023)) are critical for robustness. Automatic prompt 'mutation' cycles and spatial/semantic prompt fusion (SSPrompt (Huang et al., 2024)) provide further mIoU gains up to +16 points over default methods.

5. Applications, Limitations, and Operational Best Practices

Applications

One-prompt segmentation demonstrates efficacy in:

Interactive content editing (LSID (Ahmad, 10 Sep 2025)): end-to-end detection, mask extraction, inpainting, and description from a single user text prompt.
Universal, domain-agnostic medical and industrial segmentation: rapid transfer to unseen classes or modalities with no per-sample annotation (Wu et al., 2023, Liu et al., 2024, Gao, 14 May 2025).
Open-world segmentation: zero-shot adaptation to unseen semantic categories or visual tasks with high reliability (Tang et al., 2023, Lüddecke et al., 2021).
Anomaly/change segmentation: pairing of normal–abnormal exemplars to flag novel defects in manufacturing and defect inspection (Gao, 14 May 2025).

Limitations

All methods depend strongly on prompt quality: weak or ambiguous prompts (e.g., generic language, highly atypical reference images) degrade mask accuracy. Ensemble strategies mitigate but do not eliminate this issue (Wu et al., 2023).
Small objects, heavy occlusion, or strong cross-domain shifts represent persistent challenges—cycle-consistency, multi-scale priors, and spatial/semantic fusion help but outliers persist (Vora et al., 2024, Liu et al., 2024).
Resource constraints: inpainting-heavy pipelines (e.g., LSID (Ahmad, 10 Sep 2025)) are runtime dominated by diffusion steps (60–75% total), so efficient tuning is critical.
Some approaches (MetaUAS, GBMSeg) assume prompt–query alignment; large geometric variability or misaligned scales can affect performance (Gao, 14 May 2025, Liu et al., 2024).

Best Practices

Consistent version pinning and deterministic seed control are necessary for reproducibility (Ahmad, 10 Sep 2025).
Artifact persistence at each pipeline stage (e.g., annotated overlays, detection JSONs) enhances transparency and debugging.
UI/CLI parity ensures consistency between programmatic and user-driven segmentation experiments.
In medical domains, automated warping and region proposal (e.g., Med-PerSAM (Yoon et al., 2024)) reduces human labeling demand to one mask per dataset; downstream uses still require analyst validation.

6. Representative Systems, Comparative Analysis, and Future Directions

A cross-section of prominent systems and their distinguishing properties is summarized below:

Approach	Prompt Type	Architecture/Key Mechanism	Training-free	Domain	mIoU/Dice (best)
LSID (Ahmad, 10 Sep 2025)	Text	GroundingDINO→SAM→Diffusion→LLaVA	✗	Natural images	IoU≥0.8 (>90%)
One-Prompt (Wu et al., 2023)	Image+Sparse	Joint ViT/CNN encoders + transformer decoder	✗	Medical images	64–85% Dice
IPSeg (Tang et al., 2023)	Image	DINOv2+SD for features, clustering, train-free	✓	Open world	43.0 (COCO-20i)
ProMaC (Hu et al., 2024)	Text (generic)	MLLM hallucination→VCR→SAM+CLIP (iterative)	✓	OVS, COD, MIS	VOC 59.3 (mIoU)
OP-SAM (Mao et al., 22 Jul 2025)	Image+Mask	Multi-scale prior+Euclidean prompt iteration	✓	Medical (polyp)	76.9% IoU
SSPrompt (Huang et al., 2024)	Spatial+Sem	Prompt learning in embedding space (frozen SAM)	✗	Urban/scene	+16 mIoU over SEEM-T
MetaUAS (Gao, 14 May 2025)	Image (pair)	Meta-learned change detection, soft alignment	✗	Anomaly bench.	59.3 P-PR

Future research is trending toward richer multimodal prompts (combining text, images, sparse points, boxes), adaptive prompt refinement, multi-prompt or self-prompting pipelines, and extension to domains with extreme distributional or geometric variability (e.g., volumetric, temporal, or multispectral data). The integration of feature-prompting, cycle-consistency, automated warping, and task-generic reasoning mechanisms is enabling substantial progress, though the field continues to evolve rapidly in both algorithmic sophistication and practical reliability.

In sum, one-prompt segmentation represents a pivotal operational shift for foundation segmentation models—uniting the low annotation cost and accessibility of promptable AI with the reliability, reproducibility, and high accuracy demands of real-world segmentation workflows across natural, medical, industrial, and scientific domains (Ahmad, 10 Sep 2025, Wu et al., 2023, Tang et al., 2023, Gao, 14 May 2025).