AdaCLIP: A Novel Framework for Zero-Shot Anomaly Detection Utilizing Hybrid Learnable Prompts
The paper presents AdaCLIP, an innovation in zero-shot anomaly detection (ZSAD) that enhances the CLIP vision-LLM (VLM). The method integrates hybrid learnable prompts, including static and dynamic variations, to adapt CLIP for detecting anomalies in unseen image categories. This combination facilitates ZSAD by exploiting auxiliary anomaly detection data without needing examples from the target domain during training.
Technical Summary and Methodological Advances
AdaCLIP extends CLIP's capabilities through prompt adaptation, leveraging pre-trained VLMs with enhanced static and dynamic prompts to tailor the model for anomaly detection tasks across various industrial and medical domains. Static prompts serve as universal tokens, optimized during training to capture a range of anomaly features, while dynamic prompts are generated per test image, allowing fine-tuning of the model's response based on specific image features. This approach, referred to as hybrid prompts, showcases superior ZSAD performance.
Key Contributions:
- Hybrid Prompting Mechanism: AdaCLIP integrates static and dynamic learnable prompts within the CLIP framework, enhancing anomaly detection by adapting to both the data observed during training and novel test data.
- Use of Auxiliary Data: The model leverages diverse auxiliary datasets, demonstrating the importance of varied training data to boost the model's ability to generalize across different application domains.
- Projection and Semantic Fusion Enhancements: AdaCLIP includes a projection layer to align patch and text embeddings and proposes a Hybrid Semantic Fusion (HSF) module. This module aggregates region-level anomaly information to enhance image-level anomaly scoring.
Results
The robustness of AdaCLIP is substantiated through exhaustive experimentation over 14 datasets across industrial and medical domains, achieving state-of-the-art (SOTA) results. It consistently outperforms existing ZSAD methodologies by optimizing prompts using annotated auxiliary data, illustrating a superior generalization capability. AdaCLIP demonstrates substantial improvement in both image- and pixel-level anomaly detection metrics with an average improvement exceeding those of comparable methods, such as WinCLIP and APRIL-GAN, by notable margins.
Implications and Future Work
AdaCLIP's methodology underlines several theoretical and practical implications. The incorporation of hybrid prompts into CLIP substantiates the efficacy of prompt learning in enhancing VLMs for specific tasks, prompting further exploration into multimodal prompt learning. Practically, AdaCLIP's ability to detect anomalies without requiring known exemplars positions it as a pivotal innovation in fields like industrial inspection and medical diagnostics, where rapid deployment across varying contexts is crucial.
For future work, the authors hint at the potential for optimizing dynamism in prompt generation, exploring higher levels of contextual and functional integration of auxiliary data. Additionally, refining text prompts to capture intricate normal versus abnormal semantics in specific domains may further enhance AdaCLIP's efficacy.
AdaCLIP marks a significant advancement in ZSAD, showcasing how learnable prompts can be effectively used to adapt strong VLM backbones, like CLIP, for specialized anomaly detection purposes. This framework's promising results across diverse domains open avenues for further development in scalable and adaptable anomaly detection solutions in varying practical scenarios.