One-Prompt Segmentation Overview

Updated 23 September 2025

One-prompt segmentation is a technique that uses a single input (text, image, or embedding) to direct image segmentation, unifying zero-shot, one-shot, and open-vocabulary tasks.
It leverages shared embedding spaces from large pre-trained models to enable flexible, prompt-conditioned dense prediction without extensive retraining.
Applications in domains like medical imaging and earth observation demonstrate high accuracy and reduced annotation effort, with methods achieving metrics such as 87.27% DSC and 76.93% IoU.

One-prompt segmentation refers to a set of techniques in which a single input prompt—in the form of text, an image, or a unified embedding—serves to control and specify the target information for image segmentation models at inference time. This design contrasts with classical pipeline-trained segmentation approaches requiring category-specific, multi-shot, or per-instance conditioning, and is typically realized using promptable foundation models or specially designed architectural modules. The one-prompt regime unifies tasks such as zero-shot, one-shot, and open-vocabulary segmentation by leveraging the shared embedding spaces of large pre-trained models, thereby providing greater task flexibility without re-training and considerable reductions in annotation or adaptation effort.

1. Principles and Unified Paradigms

One-prompt segmentation reimagines segmentation as a prompt-conditioned dense prediction problem, wherein a single prompt vector determines the object or region of interest within a query image. In CLIPSeg (Lüddecke et al., 2021), for instance, the model leverages CLIP’s joint image–text embedding space: a text prompt (e.g., “cat on the mat”) or a visual support image is mapped to a conditional vector, enabling the decoder to segment the described object in the query image. The same architecture can interpolate between modalities (hybrid prompts) via linear mixing, i.e., $x_i = a \cdot s_i + (1 - a) \cdot t_i$ for image and text embeddings.

One-prompt models extend to various problem settings, such as:

Referring expression segmentation: The system segments objects specified by natural language.
Zero-shot segmentation: The system segments unseen classes prompted purely via descriptors or support images.
One-shot segmentation: A single support image and its mask, either engineered or automatically derived, specify the target, with adaptation typically realized via attention or explicit cross-similarity.

2. Architectures and Prompt Conditioning Strategies

Several architectural patterns have emerged for one-prompt segmentation:

Model Class	Core Prompt Modality	Decoding/Conditioning Mechanism
CLIPSeg	Text/image, hybrid	FiLM-modulated transformer decoder
One-Prompt Model	Visual + spatial prompt	Prompt-former blocks with cross-attention/fusion
Med-PerSAM	Warped mask+points+box	Prompted SAM; iterative prompt refinement
PromptMatcher	Text and visual	Parallel branch, union output with mask verification
MetaUAS	Visual prompt (normal img)	Feature alignment + meta-learned segmentation

A critical thread is the embedding of both the query and prompt into a shared space, followed by a conditioning mechanism (e.g., modulation layers, cross-attention, or direct concatenation). The output is a dense segmentation map, typically binary, but easily extended to multi-label cases.

3. Prompt Engineering and Automatic Prompt Generation

Prompt quality directly conditions segmentation accuracy. Visual prompts can be refined as follows:

Manual prompt engineering: Cropping, background suppression, or blurring (as in CLIPSeg (Lüddecke et al., 2021)) highlights an object.
Automatic prompt generation: Approaches like GBMSeg (Liu et al., 24 Jun 2024) and OP-SAM (Mao et al., 22 Jul 2025) use reference-mask-driven correlation (cross-image patch similarity) and adaptive sampling (e.g., forward/backward matching, exclusive/sparse/hard sampling, or distance transform centers) to extract positive and negative prompt points without user intervention.
Hybrid/iterative prompting: Models such as OP-SAM further refine prompt selection by iteratively evaluating segmentation outcomes, adding prompts only where mask coverage or boundary accuracy falls below thresholds.

For text-based prompts, prompt selection may draw on domain-specific language priors or exploit pre-trained language-vision encoders, but hybrid modes (PromptMatcher (Avogaro et al., 25 Mar 2025)) demonstrate that text and visual cues are complementary—combining both with a learning-free mask verification can outperform single-modality prompting.

4. Applications Across Domains

One-prompt segmentation has demonstrated utility and adaptation in a range of domains:

Medical Imaging: “One-Prompt to Segment All Medical Images” (Wu et al., 2023) trains a single model on multi-site, multi-modality medical datasets, enabling clinicians to segment new structures with only one template image and associated prompt. Med-PerSAM (Yoon et al., 25 Nov 2024) uses a warping-based strategy to align and refine visual prompts across X-ray, ultrasound, and CT images, eliminating the need for physician-crafted prompts or retraining.
Earth Observation: “Segment Using Just One Example” (Vora et al., 14 Aug 2024) proposes a training-free approach that stitches a single key example and an unknown image, automatically sampling and aggregating prompts to transfer the concept of “building” or “car” with no supervision.
Surgical/industrial scenes: CycleSAM (Murali et al., 9 Jul 2024) uses spatial cycle-consistency with a single annotated instance to generate reliable prompts in unseen domains, significantly reducing annotation effort for high-value, domain-specific tasks.
Open-vocabulary and anomaly segmentation: PromptMatcher (Avogaro et al., 25 Mar 2025) and MetaUAS (Gao, 14 May 2025) extend pretrained VLMs or pure-vision models to cope with class-agnostic or change-based segmentation, requiring only a single prompt (text or visual) for adaptation, with soft feature alignment mitigating geometric mismatch.

5. Performance, Trade-offs, and Empirical Insights

Empirical studies highlight both the strengths and limitations of one-prompt segmentation.

CLIPSeg (Lüddecke et al., 2021) demonstrates competitive performance in referring, zero-shot, and affordance segmentation tasks, enabled by dense, prompt-modulated decoding from robust joint embedding spaces.
Automatic prompt engineering, as in GBMSeg and OP-SAM, allows high-accuracy (e.g., 87.27% DSC for GBMSeg (Liu et al., 24 Jun 2024); 76.93% IoU for OP-SAM (Mao et al., 22 Jul 2025) on the Kvasir dataset) while avoiding retraining or manual prompt creation.
Hybrid prompt orchestration (PromptMatcher (Avogaro et al., 25 Mar 2025)) yields an 11% performance improvement by combining the strengths of text and visual modalities and employing lightweight mask verification.
Training-free paradigms such as IPSeg (Tang et al., 2023) and MetaUAS (Gao, 14 May 2025) achieve high generalization as well as computational efficiency, though specialist models can still outperform VLMs by 30% IoU on domain-specific tasks.

Trade-offs include:

Prompt ambiguity and clustering (addressed through iterative or confidence-weighted prompt selection).
Generalization versus domain adaptation (domain-specific feature backbones can improve reliability, as in CycleSAM).
Resource consumption and scaling (the inpainting or editing stage, not the segmentation, is often the computational bottleneck in full LSID pipelines (Ahmad, 10 Sep 2025)).

6. Practical Integration and Debugging

One-prompt segmentation, as illustrated with LSID pipelines (Ahmad, 10 Sep 2025), benefits from robust engineering practices:

Threshold calibration: Adjusts detection and text-alignment thresholds to optimize mask recall versus precision in object-rich or ambiguous scenes.
Morphological refinement: Post-processing steps like dilation or closing refine mask boundaries and address discontinuities from prompt inaccuracy or leakage.
Transparent artifacts: Persisting and reviewing intermediate artifacts at each stage (detection, mask overlay, mask selection) is critical for debugging and reproducibility.
UI and CLI parity: Unified interfaces ensure consistent behavior and support batch processing or interactive analysis with the same configuration and output artifacts.

7. Future Directions and Limitations

Current limitations and future research opportunities include:

Bias and robustness: Reliance on foundation model pretraining and language-image data (e.g., CLIP) can embed dataset biases, necessitating bias mitigation and out-of-distribution evaluation.
Temporal/video extension: While most methods address still images, extending one-prompt segmentation to temporal sequences requires handling appearance changes and temporal consistency (as discussed in CLIPSeg (Lüddecke et al., 2021)).
Automatic multi-prompt or sequence selection: Methods such as TPS in Sequence Prompt Transformer (SPT) (Cheng et al., 13 Dec 2024) and the cycle-prompting in PE-MED (Chang et al., 2023) and ProMaC (Hu et al., 27 Aug 2024) suggest the benefit of leveraging cross-image context and iterative refinement for more robust outcomes.
Generalization versus specialist performance: Specialist models trained for a specific segmentation domain still outperform one-prompt and VLM-based baselines, especially in challenging, out-of-distribution contexts (Avogaro et al., 25 Mar 2025).

One-prompt segmentation thus represents a convergence of large-scale pretraining, promptable architectures, and advanced engineering—in which a single, well-chosen prompt can flexibly drive robust, domain-adaptive segmentation. The field is poised for further advances in automated prompt construction, hybrid modality fusion, and context-aware adaptation across modalities and tasks.