Open-Vocabulary Segmentation Overview

Updated 31 May 2026

Open-vocabulary segmentation is a paradigm that assigns pixel or region-level semantic labels from an unbounded label set, enabling recognition of previously unseen categories.
Research leverages both training-free methods using multi-modal foundation models and end-to-end architectures to overcome the lack of dense mask-level annotations.
Evaluation metrics like Open mIoU and Open PQ are evolving to better capture the performance of models in real-world, open-world segmentation tasks.

Open-vocabulary segmentation tasks require assigning pixel-level or region-level semantic labels from an unbounded label set, including arbitrary or previously unseen categories specified by natural language. Unlike standard closed-set segmentation, where models operate on a fixed class taxonomy, open-vocabulary approaches aim to generalize beyond training categories. This paradigm is central for advancing recognition in real-world, open-world environments, and leverages large-scale vision-LLMs, multi-modal pretraining, and scalable, task-independent foundation models. Research in this domain encompasses semantic, instance, panoptic, part, and video segmentation, unified by a focus on zero-shot generalization and compositionality.

1. Core Concepts and Problem Formulation

An open-vocabulary segmentation system receives an image (or video), and either a set of class labels specified at inference (user-provided prompts) or discovers salient classes autonomously, and outputs a pixel-level semantic map or set of region masks, each matched to a label from a large vocabulary $\mathcal{V}$ . The function can be formalized as

$f: I \times \mathcal{C}^* \longrightarrow [0,1]^{H\times W\times |\mathcal{C}^*|}$

where $\mathcal{C}^*\subset\mathcal{V}$ is the effective label set for an input. The class set may be user-selected, data-dependent, or even fully auto-generated (Ülger et al., 2023, Kombol et al., 28 May 2025). The system must robustly handle any $\mathcal{C}^*$ , including out-of-distribution concepts and fine-grained or compositional entities.

A major challenge is the lack of dense mask-level annotation for most visual concepts. Therefore, recent models rely heavily on pre-trained vision-language (VLM) or visual foundation models (VFMs), cross-modal feature alignment, and synthetic data curation pipelines to bridge the gap between seen and unseen categories.

2. Methodological Taxonomy

Open-vocabulary segmentation encompasses diverse approaches, classifiable along two main dimensions: (A) the use of supervision/fine-tuning vs. training-free adaptation, and (B) architectural integration of vision-language, proposal, or generative modules.

2.1 Training-Free Methods and Foundation Model Adaptation

A dominant research line leverages zero-shot, training-free segmentation using multi-modal foundation models such as CLIP, DINO, or Stable Diffusion (Kombol et al., 28 May 2025, Li et al., 2024, Karazija et al., 2023). Approaches fall into several archetypes:

Patch-wise CLIP Matching: Extract CLIP image patch embeddings, compute cosine similarity to text embeddings of class names, and assign class per pixel [MaskCLIP, (Kombol et al., 28 May 2025)]. Variants incorporate windowed inference, self-correlative attention [SCLIP], or correlation pooling [DBA-CLIP].
CLIP+Auxiliary Mask Proposals: Use a segmentation proposal generator such as SAM, DINO, or CutLER to obtain region proposals. Aggregate CLIP features within each mask, and classify regions by similarity to text prompts [USE, DBA-CLIP, ProxyCLIP, (Wang et al., 2024, Kombol et al., 28 May 2025)].
Generative Model-Based Prototyping: Synthesize per-class support sets using text-to-image diffusion models. Extract pixel features from generated images, cluster into foreground/background prototypes, and classify real-image pixels by nearest-prototype assignment (OVDiff) (Karazija et al., 2023).

2.2 End-to-End and Jointly-Finetuned Architectures

Fine-tuned models extend Mask2Former, MaskDINO, or DETR-style frameworks with open-vocabulary capabilities. Salient directions include:

Cross-Modal Encoder-Decoder Frameworks: Integrate frozen or partially trainable text encoders for class embeddings; employ transformers for joint vision-text decoding. Example: OpenSeeD jointly trains on detection and segmentation data with decoupled mask heads, CLIP contrastive alignment, and conditioned mask generation (Zhang et al., 2023).
Attribute and Hierarchical Reasoning: AttrSeg decomposes user-supplied or LLM-generated class names into attribute fragments (color, shape, part descriptors), then aggregates attribute embeddings via clustering and attention, notably improving discrimination for ambiguous names and neologisms (Ma et al., 2023).
Hierarchical Multi-level Models: HIPIE introduces a dual-decoder (thing/stuff) framework with hierarchical supervision, supporting flexible semantic/instance/part segmentation depending on the text prompt. Decoupled text-image fusion is applied for things and not for stuff due to class embedding similarities (Wang et al., 2023).
Adapter-Augmented Segmentation: GBA introduces frequency-domain style diversification and cross-modal correlation adapters into the CLIP backbone, mitigating overfitting and reinforcing pixel-wise linguistic alignment (Xu et al., 2024).
Auto-vocabulary and Promptless Segmentation: AutoSeg clusters BLIP patch embeddings, captions regions, extracts an auto-vocabulary, and segments with an OVS backbone. A LLM-based mapping tool (LAVE) aligns predictions to traditional benchmarks (Ülger et al., 2023).

2.3 Task Specialization

Recent work expands open-vocabulary segmentation to specialized settings:

Part Segmentation: OV-PARTS benchmarks part-level segmentation with strong analogical and cross-dataset reasoning requirements, adapting both proposal+CLIP-classification and end-to-end models (Wei et al., 2023).
Audio-Visual and Video Segmentation: OV-AVSS introduces open-vocabulary semantic segmentation into audio-visual domains, employing sound localization for proposal generation and CLIP-based open-set classification (Guo et al., 2024). OV2VSS extends to video semantic segmentation with spatial-temporal and video-text fusion modules (Li et al., 2024). Open-Vocabulary Video Instance Segmentation (OVVIS) incorporates memory-induced tracking for per-instance segmentation and CLIP-based classification (Wang et al., 2023).
Remote Sensing: SegEarth-OV adapts CLIP for high-resolution, training-free segmentation in remote sensing imagery, emphasizing upsampling and global bias mitigation (Li et al., 2024).
Difficult Scenes: OVCOS tackles camouflaged object segmentation by augmenting CLIP with iterative semantic and structural guidance and validates on a new open-vocabulary camo benchmark (Pang et al., 2023).

3. Training Objectives, Optimization, and Data Pipelines

A hallmark of recent models is modularity and the strategic use of fixed or lightly-tuned foundation models.

Contrastive and Mask Segmentation Losses: Many models utilize a combination of cross-entropy, dice, and contrastive losses between image, mask, and text embeddings (Zhang et al., 2023, Wang et al., 2024, Xu et al., 2023). Joint training enables alignment for both box-level and pixel-level annotations (OpenSeeD), or alignment between group/slot tokens and text entities (OVSegmentor).
Attribute, Caption, and Hierarchy Mining: Attribute-driven methods (AttrSeg) derive training data from LLM-generated attribute fragments per class. Pipeline-based models like USE construct massive segment–text training pairs via grounded captions, referring expression localization, and mask generation (Grounding DINO + SAM) (Wang et al., 2024). AutoSeg combines BLIP captioning with region clustering to build per-image auto-vocabularies (Ülger et al., 2023).
Decoupled/Conditioned Decoding: Segmentation-detection models leverage conditional mask decoders to create pseudo-labels from detection datasets with only box-level supervision, enabling bootstrapping on large-scale detection corpora and improving zero-shot transfer (Zhang et al., 2023, Li et al., 2023).
Adapter Training and Architectural Modifications: GBA incorporates adapters (convolutions + Dirichlet-style frequency mixing + cross-attention) into frozen CLIP for feature diversification (Xu et al., 2024). SegEarth-OV modifies the CLIP transformer’s final attention and applies joint bilateral upsampling as a training-free upsampler (Li et al., 2024).

4. Evaluation Metrics and Benchmarking

Standard closed-set metrics (mIoU, AP, PQ) are ill-suited for open-vocabulary segmentation, as they penalize all mismatches equally and ignore semantic similarity.

Open Metrics: New proposals include Open mIoU, Open AP, and Open PQ, which assign partial credit to predictions based on word/label similarity, assessed via WordNet path similarity or linguistic embeddings (Zhou et al., 2023). This approach addresses over-penalization of semantically coherent predictions and provides a more faithful measure of open-world capabilities.
Benchmark Datasets and Tasks:
- Semantic segmentation: ADE20K, COCO, Pascal VOC/Context, Cityscapes, PartImageNet, OpenEarthMap.
- Panoptic and instance: ADE20K, COCO, ODinW, SeginW.
- Video: VSPW, YouTube-VIS, LV-VIS.
- Audio-visual: AVSBench-OV.
- Camouflage: OVCamo.
- Remote sensing: 17 datasets (SegEarth-OV, e.g., OpenEarthMap, LoveDA, iSAID).
Summary Table: Representative Results

Model/Method	Domain	mIoU / PQ / mAP	Zero-shot/OV Capable	Unique Features
OpenSeeD (Zhang et al., 2023)	Panoptic/Inst	PQ: 20.3–21.8 (ADE20K)	Yes	Joint seg/det, decoupled heads
AutoSeg (Ülger et al., 2023)	Semantic	mIoU: 87.1 (VOC), 29.2 (ADE20K)	Yes	Auto-vocabulary, LLM evaluator
OVDiff (Karazija et al., 2023)	Semantic	mIoU: 67.1 (VOC), 34.8 (COCO)	Yes	Diffusion prototypes, no training
GBA (Xu et al., 2024)	Semantic/Pan	PQ: 57.5 (COCO), 29.6 (ADE20K)	Yes	SDA+CCA adapters in CLIP
USE (Wang et al., 2024)	Region/Part	mIoU: 37.0 (ADE-150)	Yes	CLIP+DINO fusion, massive text-seg pairs
AttrSeg (Ma et al., 2023)	Semantic	mIoU: 56.4 (VOC-5i)	Yes	Attribute decomposition, aggregation
OVSegmentor (Xu et al., 2023)	Semantic	mIoU: 53.8 (VOC), 25.1 (COCO)	Yes	No mask ann., slot attention
SegEarth-OV (Li et al., 2024)	Remote Sensing	mIoU: 39.2 (OpenEarthMap)	Yes	Joint bilateral upsampling, CLIP bias correction
OV-PARTS (Wei et al., 2023)	Part	hIoU: 27.4–35.0	Yes	Fine-coarse granularity, analogical transfer
OV2VSS (Li et al., 2024)	Video	mIoU: 17.99 (VSPW), 27.65 (Cityscapes)	Yes	Space-time fusion, video-text encoding

5. Current Limitations and Open Challenges

Despite rapid progress, several limitations persist:

Pixel-level granularity and background modeling: CLIP’s native spatial features are coarse and strongly biased by [CLS] tokens (Li et al., 2024). Designs such as SegEarth-OV subtract global embeddings or employ independent upsampling.
Attribute and part-level compositionality: Recognition and segmentation of part and attribute entities remain weak due to limited attribute-part grounding (OVSegmentor, HIPIE). Hierarchical and attribute-based schemes improve generalization (Ma et al., 2023, Wang et al., 2023).
Ambiguity in class prompts and evaluation: Evaluating models that return rare or auto-generated class labels requires LLM-based or path-similarity scoring, but these introduce new ambiguity and comparability issues (Ülger et al., 2023, Zhou et al., 2023).
Adaptation to video and audio-visual domains: While models like OV2VSS and OV-AVSS extend to spatio-temporal and multi-modal segmentation, challenges include tracking, temporal consistency, and the fusion of dynamic features with large-scale linguistic spaces (Li et al., 2024, Guo et al., 2024).
Generalization to specialized or low-resource domains: Domain transfer (remote sensing, camouflage, long-tail objects, open-set classes) requires careful architecture choices, bias correction, or independent training-free solutions (Li et al., 2024, Pang et al., 2023).

6. Perspectives and Future Research Directions

Important future directions are identified across recent work:

Training-Free, Domain-Adaptive Systems: The trend toward zero-shot, no-training solutions for new domains is evidenced by SegEarth-OV and OVDiff, suggesting further research on plug-and-play architectures and upsamplers for various data regimes (Kombol et al., 28 May 2025, Li et al., 2024, Karazija et al., 2023).
Hierarchical and Attribute-Enriched Representations: Hierarchy-aware models (HIPIE) and attribute aggregation frameworks (AttrSeg) demonstrate gains especially for fine-grained and compositional segmentation, pointing to explicit multi-level and attribute-driven supervision (Wang et al., 2023, Ma et al., 2023).
LLM Integration and Prompt Engineering: Vision–LLMs in the loop for auto-generation of prompts, vocabulary, attribute sets, and evaluation (e.g., LAVE, dataset curation in USE) are likely to form a backbone of future OVSS pipelines (Wang et al., 2024, Ülger et al., 2023).
Unified Multi-modal, Multi-granularity Benchmarks: Datasets and standardized open-vocabulary evaluation metrics (Open mIoU/PQ/AP) will be needed to assess broad, compositional, and open-world recognition in a consistent manner (Zhou et al., 2023, Wei et al., 2023).
Efficient, Lightweight Adapters and Self-supervised Fusion: Plug-in (fine-tuning-free) modules such as GBA, correlation/masking adapters, and fusion transformers over frozen backbones show strong data-efficiency with low overfitting risk (Xu et al., 2024, Rahman et al., 28 Jan 2025).
Scalable, Foundation-Model-Powered Data Pipelines: Automated, large-scale text–segment pair mining can drive robust segment classifiers, as demonstrated in USE and similar pipelines (Wang et al., 2024).

7. Impact and Significance

Open-vocabulary segmentation research is shifting the paradigm of semantic understanding from rigid, closed sets to flexible, compositional, and extensible recognition. The convergence of scalable foundation models, automated data curation, and robust, training-free or lightly-adapted algorithms has driven rapid improvements across metrics, applications, and generalization capabilities. However, open challenges remain around evaluation methodology, compositionality, domain adaptation, and the faithful handling of the long-tail distribution of real-world visual concepts. Continued progress at the intersection of vision, language, and self-supervised learning will be central to realizing universal, truly "open-world" segmentation systems (Kombol et al., 28 May 2025, Wang et al., 2024, Zhang et al., 2023).