Open-Vocabulary Camouflage Segmentation

Updated 19 May 2026

OVCOS is a computer vision task that generates pixel-wise masks for camouflaged objects while identifying them from an open and diverse vocabulary.
It leverages advanced vision-language models, diffusion-based features, and prompt-adaptive segmentation architectures to overcome visual ambiguity.
Recent approaches integrate lightweight adapters, multi-scale feature alignment, and cascaded architectures to enhance segmentation precision and classification accuracy.

Open-Vocabulary Camouflage Segmentation (OVCOS) is a computer vision task that unifies the challenges of camouflaged object segmentation (COS) with open-vocabulary recognition. OVCOS requires models to generate pixel-wise segmentation masks for camouflaged objects while simultaneously recognizing the semantic class from an open, potentially infinite, vocabulary—including categories unseen during training. The intrinsic difficulty arises from extreme visual ambiguity (objects highly blended with the background) and the demand for strong generalization beyond the closed-set regime. Research in this domain has been catalyzed by the creation of dedicated benchmarks, novel model architectures emphasizing vision-language alignment, and a critical re-examination of the classification bottleneck. OVCOS methods have advanced rapidly since 2023, incorporating frozen and fine-tuned vision-LLMs (VLMs), prompt-adaptive segmentation architectures, diffusion-based features, and parameter-efficient adapters.

1. Problem Definition and Benchmark Datasets

OVCOS is formally defined as learning a function $f_\theta(X; c) \to \hat{M}(c) \in [0,1]^{H \times W}$ where $X \in \mathbb{R}^{H \times W \times 3}$ is an input image and $c$ is a class from an open vocabulary $\mathcal{C}$ . For each $c$ , the aim is to predict $\hat{M}(c)$ that approximates the ground-truth mask $M_n(c)$ for camouflaged objects of class $c$ , including those not encountered during training (Pang et al., 2023).

The OVCamo dataset was introduced to address limitations of legacy camouflaged image datasets, such as ambiguous class definitions and limited annotation diversity. OVCamo contains 11,483 finely annotated images across 75 carefully unified object classes, with a strict separation of 14 classes for training (seen) and 61 for testing (unseen). Unlike standard open-vocabulary segmentation datasets, OVCamo is tailored for fine-grained camouflage scenarios: prevalent small object sizes (median area ratio $\approx$ 0.05), minimal foreground–background color contrast (median color ratio $\approx$ 1.1), and long-tailed class distributions. Evaluation metrics include class-aware IoU (cIoU), class-aware S-measure (cSm), mean absolute error (cMAE), and variants adapted from dense prediction tasks (Pang et al., 2023).

2. Single-stage and Cascaded Architectures

OVCOS research has produced both single-stage and multi-stage model architectures. The OVCoser baseline is a single-stage segmentation transformer operating atop a frozen CLIP backbone, integrating two key modules: Semantic Guidance (SG) and Structure Enhancement (SE) (Pang et al., 2023). SG injects normalized text embeddings into each decoder layer via spatial weights, aligning visual features with the target class. SE branches extract edge and depth cues to enhance boundary localization—particularly critical in camouflage contexts. The OVCoser decoder iteratively refines masks, updating a spatial prior to improve object delineation over successive passes, and uses masked average-pooling for class identification.

Recent advances have favored cascaded (two-stage) frameworks. In COCUS (Zhao et al., 24 Jun 2025), segmentation is first performed by a Segment Anything Model (SAM) variant prompted explicitly by VLM-derived semantic embeddings. The predicted mask is then re-integrated into a vision-language classifier by treating the mask as an additional alpha channel, preserving scene context and providing spatial guidance. This design alleviates the domain gap caused by hard-cropping (i.e., presenting only cropped windows of the object for classification), which fundamentally misaligns with the holistic training protocol of most VLMs.

3. Integration of Vision-LLMs and Prompt Engineering

A hallmark of modern OVCOS methods is leveraging pre-trained vision-LLMs, especially CLIP, both for semantic query construction and feature backbone. Text prompts are carefully engineered. Templates such as "A photo of the camouflaged ⟨class⟩" or "A photo of a ⟨class⟩ camouflaged in the background" are tokenized and fed into the frozen or lightly adapted text encoder (Zhang et al., 29 Sep 2025, Zhao et al., 24 Jun 2025, Pang et al., 2023).

In classifier-centric frameworks, lightweight adapters are inserted into the CLIP text encoder’s final layer, enhancing the representation of class queries without full re-training (Zhang et al., 29 Sep 2025). Adapter outputs are residually added to the base encoding, and similarity to visual embeddings (obtained from the image, sometimes concatenated with a coarse prediction mask) is computed via cosine similarity and softmax over candidate classes. Iterative feedback—feeding segmentation masks back into the vision encoder to spatially narrow down the region for classification—has been demonstrated to improve both mask refinement and class prediction.

Prompt-tuning is further optimized with test-time augmentation and strategies such as confidence-weighted ensembling and prompt diversity, with ablations confirming the substantial gains derived from careful prompt design and classifier adaptation (Zhang et al., 29 Sep 2025, Zhao et al., 24 Jun 2025).

4. Advances in Adapter-based and Parameter-Efficient Learning

A critical development is the identification of the classification component as the chief bottleneck in OVCOS pipelines, superseding segmentation quality as the main limiting factor (Zhang et al., 29 Sep 2025). Efforts to mitigate this have focused on parameter-efficient fine-tuning mechanisms, notably via lightweight adapters. In the classifier-centric adaptive framework, a three-layer bottleneck adapter is applied at the final layer of the CLIP text encoder, with a novel Layered Asymmetric Initialization (LAI): each adapter matrix is initialized from a zero-mean Gaussian with strictly decreasing variance from input to output ( $X \in \mathbb{R}^{H \times W \times 3}$ 0). This initialization empirically encourages richer adaptation capacity for the compressed space while preserving downstream stability.

Quantitatively, adapter-enhanced frameworks surpass OVCoser’s cIoU of 0.443 (CLIP baseline) to 0.493 (adapter + LAI + TTA), confirming the centrality of classification adaptation. Perfect “oracle” classification would further raise cIoU to 0.570, strongly indicating a persistent headroom for future improvements (Zhang et al., 29 Sep 2025).

Parameter-Efficient Fine-Tuning (PEFT) has also been adopted in purely vision-based zero-shot COS frameworks (Lei et al., 2024). These insert small adapters parallel to the MLPs of the vision transformer encoder (e.g., AdaptFormer), substantially reducing the number of trainable parameters relative to full fine-tuning, and achieving state-of-the-art zero-shot F $X \in \mathbb{R}^{H \times W \times 3}$ 1 scores despite training solely on salient object segmentation data.

5. Multi-Scale Feature Alignment and Cross-Domain Fusion

The importance of dense, multi-scale feature aggregation is highlighted in both diffusion-based and transformer models. In open-vocabulary diffusion models, a frozen Stable Diffusion U-Net serves as an image feature backbone, from which multi-scale feature maps are aggregated and fused with CLIP-derived text embeddings (Vu et al., 2023). The Multi-Scale Feature Fusion (MSFF) module concatenates multiple encoder features, applies $X \in \mathbb{R}^{H \times W \times 3}$ 2 convolutions and elementwise multiplication, generating a unified mask-relevant representation.

Textual-Visual Aggregation (TVA) modules leverage predicted instance masks to crop and pool features for class prediction, with normalization and attention reweighting to align visual and text-driven cues. Camouflaged Instance Normalization (CIN), an AdaIN-inspired affine normalization module, further refines mask predictions per instance.

Parameter-efficient approaches for zero-shot segmentation utilize masked image modeling (MIM) pre-trained encoders, coupled with fine-grained alignment modules (Multi-scale Fine-grained Alignment, MFA) to bridge caption and vision features at multiple spatial levels (Lei et al., 2024).

6. Zero-shot and Annotation-Free OVCOS

A research emphasis has emerged on zero-shot OVCOS—segmentation and recognition of camouflaged objects without exposure to any camouflaged instances or annotations during training. By leveraging the local pattern bias learned from salient object segmentation (SOS) datasets and transferring broad semantic structure via masked autoencoding and MLLM caption alignment, performant zero-shot COS is now achievable (Lei et al., 2024). These methods employ MIM-pretrained vision transformers, BLIP-2 MLLM caption encoders, and a learnable codebook that replaces the MLLM at inference for fast, annotation-free OVCOS (e.g., 18.1 FPS on RTX 4060Ti, model size 332M). Notably, this pipeline outperforms weakly supervised competitors by 5–10 F $X \in \mathbb{R}^{H \times W \times 3}$ 3 points on standard COS datasets despite never seeing camouflage data in training.

A plausible implication is that open-vocabulary segmentation, when properly equipped with transferable visual priors and multi-modal supervision (even if purely SOS-based), can approach or match the performance of more heavily supervised baselines for highly ambiguous domains such as camouflage.

7. Limitations, Ablation Insights, and Ongoing Challenges

Current OVCOS pipelines reveal several persistent obstacles. First, classification remains the rate-limiting bottleneck, with even state-of-the-art frameworks exhibiting a significant performance gap to oracle classifiers (cIoU difference of approximately 0.08) (Zhang et al., 29 Sep 2025). External mask proposals, simplistic spatial priors, or reliance solely on frozen CLIP features may limit ultimate segmentation precision. Test-time augmentation, while effective, increases inference cost.

Dataset bias and generalization remain open issues: principal benchmarks like OVCamo are validated on naturalistic camouflage scenes; transferability to specialized domains (e.g., medical, aerial) is largely untested. Decoder and alignment modules (e.g., MLP-based mask refinement) may restrict fine mask resolution or boundary sharpness, especially when camouflaged objects are extremely large or fragmented (Lei et al., 2024, Pang et al., 2023).

Table: Summary of OVCOS Model Performance on OVCamo (unseen categories)

Method	cSm	cFωβ	cMAE	cFβ	cEm	cIoU
OVCoser’24	0.579	0.490	0.336	0.520	0.616	0.443
Adapter+LAI+TTA	0.658	0.547	0.239	0.582	0.696	0.493
COCUS (cascaded)	0.668	0.615	0.265	0.631	0.697	0.568

Ongoing research explores coupling adapter strategies with segmentation-head co-training, theoretical analysis of asymmetric initializations, and prompt engineering refinement for further performance gains (Zhang et al., 29 Sep 2025). Dynamic codebook learning and cross-modal alignment remain key areas for extending OVCOS capacity to broader, more diverse vocabularies and novel semantic domains (Lei et al., 2024).

References

(Pang et al., 2023) Open-Vocabulary Camouflaged Object Segmentation
(Zhang et al., 29 Sep 2025) Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation
(Zhao et al., 24 Jun 2025) Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision LLMs
(Vu et al., 2023) Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation
(Lei et al., 2024) Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations