AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection (2310.18961v8)

Published 29 Oct 2023 in cs.CV

Abstract: Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-LLMs (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.

PDF HTML Abstract

An Insightful Overview of "AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection"

The paper, "AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection", proposes a novel approach to enhance zero-shot anomaly detection (ZSAD) capabilities in vision-LLMs (VLMs) like CLIP. Anomaly detection is critical for applications where training data in the target domain are unavailable due to privacy reasons or domain novelty, and this paper addresses this problem by leveraging large pre-trained VLMs, which traditionally exhibit weak performance in zero-shot scenarios.

Core Contributions and Methodology

AnomalyCLIP introduces a pivotal idea of using object-agnostic prompt learning, where the main innovation lies in the generation of generic text prompts that capture both normality and abnormality within images, independent of specific foreground objects. The methodological strength of AnomalyCLIP is in its design of universally applicable prompts that disregard irrelevant object semantics, enhancing the model's ability to detect anomalies purely based on abnormal regions, thus enabling omni-domain adaptability.

To operationalize this, the paper employs a learnable prompt template strategy to refine these generic prompts and align them with both image-level and pixel-level features. The mechanism combines global and local context optimization, which together allow for the capturing of both overarching and fine-grained anomaly characteristics. Moreover, the introduction of a Diagonally Prominent Attention Map (DPAM) addresses the shortcoming of traditional attention mechanisms by refining local visual semantics, crucial for accurate anomaly segmentation.

Experimental Evaluation and Results

The research paper supports its claims with large-scale experimental validations across 17 real-world anomaly detection datasets, encompassing diverse domains such as industrial inspection and medical imaging. AnomalyCLIP demonstrates superior performance in both anomaly classification and segmentation tasks across datasets involving diverse object content and textures. Notably, the adoption of object-agnostic prompts allows the model to generalize effectively across vastly different domains, circumventing the need for domain-specific training data.

Implications and Future Directions

The implications of AnomalyCLIP are twofold. Practically, it sets a new benchmark for zero-shot anomaly detection tasks by achieving substantive performance without domain-specific fine-tuning. Theoretically, it challenges the conventional paradigm of anomaly detection by decoupling object-specific knowledge from the anomaly detection process, opening avenues for further exploration in unsupervised and self-supervised learning paradigms.

For future work, the authors suggest potential expansions in the scope of auxiliary datasets to further enhance the generalization capabilities and robustness of the model. Additionally, exploring the integration of AnomalyCLIP with other modalities such as text and audio could foster cross-modal anomaly detection systems, paving the way towards more holistic and robust AI applications.

In conclusion, "AnomalyCLIP" makes significant strides towards improving zero-shot anomaly detection by introducing a paradigm shift in how prompts are utilized to identify anomalies across diverse datasets, while maintaining applicability and efficiency. The work charts a promising course for the intersection of anomaly detection and large pre-trained models, potentially catalyzing future advancements in the field.