Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No (2308.12213v2)

Published 23 Aug 2023 in cs.CV and cs.AI

Abstract: Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD detection methods driven by CLIP, which only require class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers the logic of saying no within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples using positive-semantic prompts and negation-semantic prompts. Specifically, we design a novel learnable no prompt and a no text encoder to capture negation semantics within images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with no prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection by utilizing negation semantics from no prompts and the text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN, based on ViT-B-16, outperforms 7 well-used algorithms by at least 2.34% and 11.64% in terms of AUROC and FPR95 for zero-shot OOD detection on ImageNet-1K. Our CLIPN can serve as a solid foundation for effectively leveraging CLIP in downstream OOD tasks. The code is available on https://github.com/xmed-lab/CLIPN.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hualiang Wang (20 papers)
  2. Yi Li (482 papers)
  3. Huifeng Yao (9 papers)
  4. Xiaomeng Li (109 papers)
Citations (68)

Summary

Overview of CLIPN for Zero-Shot OOD Detection

The paper "CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No" introduces a novel approach to enhancing the performance of Contrastive Language-Image Pre-Training (CLIP) models in zero-shot out-of-distribution (OOD) detection tasks. The primary aim of this work is to adapt CLIP to effectively identify and differentiate between in-distribution (ID) and OOD samples without extensive retraining on new datasets.

Methodology

The proposed method, referred to as CLIPN (CLIP with 'no' logic), innovatively modifies the traditional CLIP model to include a mechanism for recognizing negation semantics. The authors achieve this by introducing a new architecture that incorporates learnable "no" prompts and a "no" text encoder. The "no" prompts function to negate the semantics associated with ID class names, enabling the CLIP model to effectively learn to reject OOD samples.

To optimize this framework, the authors design two novel loss functions:

  1. Image-Text Binary-Opposite Loss (ITBO): This loss aligns image features with negation-consistent text features, teaching the model when to associate images with negation semantics.
  2. Text Semantic-Opposite Loss (TSO): This ensures that features from standard prompts (positive logic) and "no" prompts (negative logic) are distant in the feature space, enhancing the semantic understanding of negation.

These loss functions are integral to training the model to differentiate between ID and OOD samples by exploiting negation semantics.

Inference and Evaluation

The paper proposes two threshold-free inference algorithms—Competing-to-Win (CTW) and Agreeing-to-Differ (ATD). CTW utilizes the highest confidence score determined from both standard and "no" text encoders, while ATD calculates an OOD class probability, allowing the model to flexibly classify samples as either ID or OOD.

Experimental evaluations demonstrate that CLIPN surpasses traditional OOD detection methods in terms of AUROC and FPR95 metrics across several datasets. Notably, the use of ViT-B-16-based encoder leads to an average AUROC improvement of at least 2.34% and a reduction in FPR95 by 11.64% compared to state-of-the-art models such as MCM.

Implications and Future Work

The implications of this research are substantial for the field of AI, particularly in applications requiring robust and adaptive models capable of functioning in open-world conditions. CLIPN sets a solid foundation for leveraging CLIP for OOD detection tasks, supporting a wider application across diverse datasets without extensive fine-tuning.

Future developments may focus on extending the utility of the "no" logic to other vision-LLMs, improving robustness in real-world scenarios. The adaptability of CLIPN in specialized domains such as medical imaging could also be explored, potentially enhancing autonomous decision-making systems in safety-critical applications.

In summary, the introduction of negation semantics into CLIP for zero-shot OOD detection represents a substantive contribution to enhancing model robustness and flexibility in open-world AI applications.

Github Logo Streamline Icon: https://streamlinehq.com