CLIP-Conditioned Objectness Adjustment (COAT)
- The paper introduces COAT, a framework that uses CLIP-derived embeddings to adjust objectness scores, improving detection of novel and rare classes.
- It recalibrates region proposals in both mask-based and detection pipelines by integrating test-time CLIP conditioning and hierarchical confidence calibration.
- Experimental results show consistent gains in panoptic quality and object detection metrics with minimal computational overhead.
CLIP-Conditioned Objectness Adjustment (COAT) is a framework designed to mitigate objectness bias and improve region-to-text alignment in open-vocabulary vision tasks, notably panoptic segmentation and detection. COAT leverages large-scale vision–language pretraining from CLIP to recalibrate objectness estimation in both mask-based (Mask2Former-style) and detection (Faster R-CNN/Mask R-CNN) pipelines, compensating for the well-documented tendency of region proposal or mask heads trained on closed vocabularies to underestimate or suppress novel-class regions during inference. The principal technical innovation is the use of CLIP-derived region-level embeddings to condition or directly predict objectness, either through test-time adjustment (as in panoptic segmentation) or via a lightweight adaptation of CLIP itself (LoCLIP) for proposal scoring. COAT also incorporates hierarchical calibration of region–text matching, further refining pseudo-label quality for open-vocabulary detection models. Implementations in both panoptic segmentation and object detection yield consistent gains, especially for unseen or out-of-vocabulary (OOV) classes, with negligible computational or memory overhead (Kormushev et al., 22 Mar 2026, Lee et al., 25 Apr 2026).
1. Motivation and Problem Statement
Standard region scoring heads—such as objectness predictors in RPNs or mask transformers—are typically trained on closed-set datasets. They exhibit a pronounced bias against objects from categories absent during training, yielding low objectness for OOV proposals. This leads to systematic pruning of high-quality masks/boxes depicting rare, fine-grained, or previously unseen objects. Concurrently, vision–LLMs like CLIP, while robust at global image classification, are suboptimal at localized, region-level classification and objectness prediction. COAT addresses these failures by integrating CLIP’s open-vocabulary, large-scale language correspondence as a conditioning signal on objectness, thereby “lifting” low objectness scores for OOV regions when language cues provide high certainty.
2. CLIP-Conditioned Objectness Adjustment in Panoptic Segmentation
In Mask2Former-style open-vocabulary panoptic segmentation, each mask proposal is scored by:
- : mask-transformer objectness (foreground/background gating)
- : per-vocabulary-class probability vector, optionally ensembled
The standard classification vector for mask is:
where low routes the proposal to background, causing OOV objects to be mis-suppressed.
The COAT adjustment mechanism proceeds as follows:
- Mask-Pooled CLIP Embedding: For the CLIP image-encoder feature map , compute the mask-pooled embedding:
- Text Embedding Matrix: CLIP text encoder yields for target classes.
- CLIP-Based Classification Distribution:
0
The “certainty” score is 1.
- Objectness Adjustment:
2
with 3 controlling trust in CLIP (4 empirically effective).
- Corrected Classification Score:
5
This routes high-CLIP-certainty proposals to foreground, even when 6 would ignore them. COAT is parameter-free and operates exclusively at test time.
3. LoCLIP: Region-Aware CLIP and Objectness Estimation in Detection
For open-vocabulary object detection (OVD), COAT introduces LoCLIP, a minimally adapted CLIP variant for unbiased objectness scoring in region proposals:
- The frozen CLIP Vision Transformer is extended with a learnable [OBJ] token added to the input.
- Per box/proposal 7, masked attention zeroes-out patch embeddings outside 8, focusing [OBJ] purely on the region.
- After the final ViT layer, the [OBJ] token activation 9 is processed:
0
with 1, 2; only 3, 4, and [OBJ] are trained.
- Objectness targets 5 are defined by IoU with ground-truth boxes, and LoCLIP is trained with a standard BCE loss.
This approach yields reliable, bias-mitigated objectness scores for both base and OOV classes, requiring only 63K trainable parameters and converging rapidly (Lee et al., 25 Apr 2026).
4. Hierarchical Confidence Calibration for Region-to-Text Matching
COAT incorporates Hierarchical Confidence Calibration (HCC) to improve pseudo label reliability:
- For each region proposal, region and candidate class embeddings are compared via cosine similarity, followed by a softmax.
- Sub-category and super-category calibrations: Leveraging class hierarchies (obtained, e.g., from an LLM), softmax distributions over all sub- and supercategories are computed, and the resulting maximum probabilities per class are used to reweight initial class scores.
- Final region-level class confidence is the mean of sub- and super-category calibrations:
7
- Pseudo labels are assigned only when calibrated confidence 8 and objectness 9.
HCC mitigates erroneous background matches and enhances detection quality for both base and novel classes.
5. Training and Inference Workflow
COAT is integrated differently for panoptic segmentation and detection:
Panoptic Segmentation (Kormushev et al., 22 Mar 2026):
- Pretrain Mask2Former with standard losses.
- At test time, run Mask2Former for mask proposals and per-mask scores.
- Forward image through CLIP encoders to obtain pooled region and text embeddings.
- Apply COAT reweighting to objectness and classification as a post-processing step.
- No additional loss or retraining; only the CLIP trust factor 0 is required.
Object Detection with LoCLIP and HCC (Lee et al., 25 Apr 2026):
- Train LoCLIP’s [OBJ] token and final FC layer for objectness on a small subset (11% data); all other CLIP weights frozen.
- For pseudo label generation:
- RPN yields proposals.
- LoCLIP and HCC assign pseudo class and objectness confidences.
- Filtered pseudo labels (above 2, 3 thresholds) supplement GT data for Faster/Mask R-CNN training.
- OV detector is trained with loss terms down-weighted by HCC confidence and objectness for pseudo-labeled novel class regions.
6. Experimental Results and Impact
COAT uniformly improves panoptic and detection benchmarks, especially on OOV/novel classes.
| Dataset / Setting | Baseline | +COAT | Uplift |
|---|---|---|---|
| ADE20K (panoptic PQ) | 26.8% | 27.6% | +0.8 pp |
| Mapillary (PQ) | 18.3% | 18.8% | +0.5 pp |
| Cityscapes (PQ) | – | +3% (OVRCOAT) | +3 pp |
| OV-COCO AP50 (novel) | 32.2 | 38.9 | +6.7 pts |
| OV-LVIS mAPN | 19.8 | 21.7 | +1.9 pts |
- Per-class gains on ADE20K revealed +25% relative PQ for unseen categories (“paintings”), with negligible impact on seen classes.
- Objectness quality for novel classes (Spearman 4 between IoU and objectness): LoCLIP 0.473, standard RPN 0.038.
- Runtime overhead of COAT compared to prior pseudo-labeling baselines is minimal (+2.3% per image).
A plausible implication is that as CLIP and related vision–language encoders grow in capacity and text alignment capability, lightweight conditioning techniques such as COAT will increasingly supplant traditional retraining- or adaptation-heavy debiasing approaches in open-vocabulary settings.
7. Implementation Considerations and Hyperparameters
- Panoptic COAT is parameter-free and test-time only; only 5 is tuned (typically 0.5).
- LoCLIP training involves only the [OBJ] embedding and a 1-layer FC, with AdamW and 6 of the training set, converging in minutes.
- For pseudo-labeling, recommended thresholds: HCC confidence 7 (OV-COCO), 8 (OV-LVIS), objectness 9 (COCO), 0 (LVIS).
- Hierarchical calibration requires super- and sub-category lists per class (supplied, e.g., by LLMs).
- Final detector training is standard, except that loss weights for pseudo labels are scaled by HCC confidence and LoCLIP objectness.
For implementation fidelity, practitioners must prepare CLIP and RPN/backbone models, integrate masked-attention LoCLIP, acquire appropriate class hierarchies, and implement the region-to-text calibration step (Kormushev et al., 22 Mar 2026, Lee et al., 25 Apr 2026).