Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization (2308.15939v2)

Published 30 Aug 2023 in cs.CV

Abstract: Contrastive Language-Image Pre-training (CLIP) models have shown promising performance on zero-shot visual recognition tasks by learning visual representations under natural language supervision. Recent studies attempt the use of CLIP to tackle zero-shot anomaly detection by matching images with normal and abnormal state prompts. However, since CLIP focuses on building correspondence between paired text prompts and global image-level representations, the lack of fine-grained patch-level vision to text alignment limits its capability on precise visual anomaly localization. In this work, we propose AnoCLIP for zero-shot anomaly localization. In the visual encoder, we introduce a training-free value-wise attention mechanism to extract intrinsic local tokens of CLIP for patch-level local description. From the perspective of text supervision, we particularly design a unified domain-aware contrastive state prompting template for fine-grained vision-language matching. On top of the proposed AnoCLIP, we further introduce a test-time adaptation (TTA) mechanism to refine visual anomaly localization results, where we optimize a lightweight adapter in the visual encoder using AnoCLIP's pseudo-labels and noise-corrupted tokens. With both AnoCLIP and TTA, we significantly exploit the potential of CLIP for zero-shot anomaly localization and demonstrate the effectiveness of AnoCLIP on various datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hanqiu Deng (9 papers)
  2. Zhaoxiang Zhang (161 papers)
  3. Jinan Bao (3 papers)
  4. Xingyu Li (104 papers)
Citations (1)