Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLIP-Conditioned Objectness Adjustment (COAT)

Updated 31 May 2026
  • The paper introduces COAT, a framework that uses CLIP-derived embeddings to adjust objectness scores, improving detection of novel and rare classes.
  • It recalibrates region proposals in both mask-based and detection pipelines by integrating test-time CLIP conditioning and hierarchical confidence calibration.
  • Experimental results show consistent gains in panoptic quality and object detection metrics with minimal computational overhead.

CLIP-Conditioned Objectness Adjustment (COAT) is a framework designed to mitigate objectness bias and improve region-to-text alignment in open-vocabulary vision tasks, notably panoptic segmentation and detection. COAT leverages large-scale vision–language pretraining from CLIP to recalibrate objectness estimation in both mask-based (Mask2Former-style) and detection (Faster R-CNN/Mask R-CNN) pipelines, compensating for the well-documented tendency of region proposal or mask heads trained on closed vocabularies to underestimate or suppress novel-class regions during inference. The principal technical innovation is the use of CLIP-derived region-level embeddings to condition or directly predict objectness, either through test-time adjustment (as in panoptic segmentation) or via a lightweight adaptation of CLIP itself (LoCLIP) for proposal scoring. COAT also incorporates hierarchical calibration of region–text matching, further refining pseudo-label quality for open-vocabulary detection models. Implementations in both panoptic segmentation and object detection yield consistent gains, especially for unseen or out-of-vocabulary (OOV) classes, with negligible computational or memory overhead (Kormushev et al., 22 Mar 2026, Lee et al., 25 Apr 2026).

1. Motivation and Problem Statement

Standard region scoring heads—such as objectness predictors in RPNs or mask transformers—are typically trained on closed-set datasets. They exhibit a pronounced bias against objects from categories absent during training, yielding low objectness for OOV proposals. This leads to systematic pruning of high-quality masks/boxes depicting rare, fine-grained, or previously unseen objects. Concurrently, vision–LLMs like CLIP, while robust at global image classification, are suboptimal at localized, region-level classification and objectness prediction. COAT addresses these failures by integrating CLIP’s open-vocabulary, large-scale language correspondence as a conditioning signal on objectness, thereby “lifting” low objectness scores for OOV regions when language cues provide high certainty.

2. CLIP-Conditioned Objectness Adjustment in Panoptic Segmentation

In Mask2Former-style open-vocabulary panoptic segmentation, each mask proposal MiM_i is scored by:

  • pobjp_{\text{obj}}: mask-transformer objectness (foreground/background gating)
  • pensp_{\text{ens}}: per-vocabulary-class probability vector, optionally ensembled

The standard classification vector for mask ii is:

pcls=[penspobj;  1pobj]p_{\text{cls}} = [\,p_{\text{ens}} \cdot p_{\text{obj}};\; 1 - p_{\text{obj}}\,]

where low pobjp_{\text{obj}} routes the proposal to background, causing OOV objects to be mis-suppressed.

The COAT adjustment mechanism proceeds as follows:

  1. Mask-Pooled CLIP Embedding: For the CLIP image-encoder feature map Fimg(u,v)REF_{\text{img}}(u,v)\in\mathbb{R}^E, compute the mask-pooled embedding:

Fseg,i=u,vMi(u,v)Fimg(u,v)u,vMi(u,v)REF_{\text{seg},i} = \frac{\sum_{u,v} M_i(u,v) F_{\text{img}}(u,v)}{\sum_{u,v} M_i(u,v)} \in \mathbb{R}^E

  1. Text Embedding Matrix: CLIP text encoder yields FtxtRNcls×EF_{\text{txt}}\in\mathbb{R}^{N_{\text{cls}}\times E} for NclsN_{\text{cls}} target classes.
  2. CLIP-Based Classification Distribution:

pobjp_{\text{obj}}0

The “certainty” score is pobjp_{\text{obj}}1.

  1. Objectness Adjustment:

pobjp_{\text{obj}}2

with pobjp_{\text{obj}}3 controlling trust in CLIP (pobjp_{\text{obj}}4 empirically effective).

  1. Corrected Classification Score:

pobjp_{\text{obj}}5

This routes high-CLIP-certainty proposals to foreground, even when pobjp_{\text{obj}}6 would ignore them. COAT is parameter-free and operates exclusively at test time.

3. LoCLIP: Region-Aware CLIP and Objectness Estimation in Detection

For open-vocabulary object detection (OVD), COAT introduces LoCLIP, a minimally adapted CLIP variant for unbiased objectness scoring in region proposals:

  • The frozen CLIP Vision Transformer is extended with a learnable [OBJ] token added to the input.
  • Per box/proposal pobjp_{\text{obj}}7, masked attention zeroes-out patch embeddings outside pobjp_{\text{obj}}8, focusing [OBJ] purely on the region.
  • After the final ViT layer, the [OBJ] token activation pobjp_{\text{obj}}9 is processed:

pensp_{\text{ens}}0

with pensp_{\text{ens}}1, pensp_{\text{ens}}2; only pensp_{\text{ens}}3, pensp_{\text{ens}}4, and [OBJ] are trained.

  • Objectness targets pensp_{\text{ens}}5 are defined by IoU with ground-truth boxes, and LoCLIP is trained with a standard BCE loss.

This approach yields reliable, bias-mitigated objectness scores for both base and OOV classes, requiring only pensp_{\text{ens}}63K trainable parameters and converging rapidly (Lee et al., 25 Apr 2026).

4. Hierarchical Confidence Calibration for Region-to-Text Matching

COAT incorporates Hierarchical Confidence Calibration (HCC) to improve pseudo label reliability:

  • For each region proposal, region and candidate class embeddings are compared via cosine similarity, followed by a softmax.
  • Sub-category and super-category calibrations: Leveraging class hierarchies (obtained, e.g., from an LLM), softmax distributions over all sub- and supercategories are computed, and the resulting maximum probabilities per class are used to reweight initial class scores.
  • Final region-level class confidence is the mean of sub- and super-category calibrations:

pensp_{\text{ens}}7

  • Pseudo labels are assigned only when calibrated confidence pensp_{\text{ens}}8 and objectness pensp_{\text{ens}}9.

HCC mitigates erroneous background matches and enhances detection quality for both base and novel classes.

5. Training and Inference Workflow

COAT is integrated differently for panoptic segmentation and detection:

Panoptic Segmentation (Kormushev et al., 22 Mar 2026):

  • Pretrain Mask2Former with standard losses.
  • At test time, run Mask2Former for mask proposals and per-mask scores.
  • Forward image through CLIP encoders to obtain pooled region and text embeddings.
  • Apply COAT reweighting to objectness and classification as a post-processing step.
  • No additional loss or retraining; only the CLIP trust factor ii0 is required.

Object Detection with LoCLIP and HCC (Lee et al., 25 Apr 2026):

  • Train LoCLIP’s [OBJ] token and final FC layer for objectness on a small subset (ii11% data); all other CLIP weights frozen.
  • For pseudo label generation:
    • RPN yields proposals.
    • LoCLIP and HCC assign pseudo class and objectness confidences.
    • Filtered pseudo labels (above ii2, ii3 thresholds) supplement GT data for Faster/Mask R-CNN training.
  • OV detector is trained with loss terms down-weighted by HCC confidence and objectness for pseudo-labeled novel class regions.

6. Experimental Results and Impact

COAT uniformly improves panoptic and detection benchmarks, especially on OOV/novel classes.

Dataset / Setting Baseline +COAT Uplift
ADE20K (panoptic PQ) 26.8% 27.6% +0.8 pp
Mapillary (PQ) 18.3% 18.8% +0.5 pp
Cityscapes (PQ) +3% (OVRCOAT) +3 pp
OV-COCO AP50 (novel) 32.2 38.9 +6.7 pts
OV-LVIS mAPN 19.8 21.7 +1.9 pts
  • Per-class gains on ADE20K revealed +25% relative PQ for unseen categories (“paintings”), with negligible impact on seen classes.
  • Objectness quality for novel classes (Spearman ii4 between IoU and objectness): LoCLIP 0.473, standard RPN 0.038.
  • Runtime overhead of COAT compared to prior pseudo-labeling baselines is minimal (+2.3% per image).

A plausible implication is that as CLIP and related vision–language encoders grow in capacity and text alignment capability, lightweight conditioning techniques such as COAT will increasingly supplant traditional retraining- or adaptation-heavy debiasing approaches in open-vocabulary settings.

7. Implementation Considerations and Hyperparameters

  • Panoptic COAT is parameter-free and test-time only; only ii5 is tuned (typically 0.5).
  • LoCLIP training involves only the [OBJ] embedding and a 1-layer FC, with AdamW and ii6 of the training set, converging in minutes.
  • For pseudo-labeling, recommended thresholds: HCC confidence ii7 (OV-COCO), ii8 (OV-LVIS), objectness ii9 (COCO), pcls=[penspobj;  1pobj]p_{\text{cls}} = [\,p_{\text{ens}} \cdot p_{\text{obj}};\; 1 - p_{\text{obj}}\,]0 (LVIS).
  • Hierarchical calibration requires super- and sub-category lists per class (supplied, e.g., by LLMs).
  • Final detector training is standard, except that loss weights for pseudo labels are scaled by HCC confidence and LoCLIP objectness.

For implementation fidelity, practitioners must prepare CLIP and RPN/backbone models, integrate masked-attention LoCLIP, acquire appropriate class hierarchies, and implement the region-to-text calibration step (Kormushev et al., 22 Mar 2026, Lee et al., 25 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLIP-Conditioned Objectness Adjustment (COAT).