Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection (2404.09654v2)

Abstract: Large vision-LLMs (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a LLM. This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec and 8.9% on VisA compared to state-of-the-art approaches.
- Zero-shot versus many-shot: Unsupervised texture anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5564–5572.
- MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9592–9600.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Segment Any Anomaly without Training via Hybrid Prompt Regularization. arXiv preprint arXiv:2305.10724 (2023).
- Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 397–406.
- Easynet: An easy network for 3d industrial anomaly detection. In Proceedings of the 31st ACM International Conference on Multimedia. 7038–7046.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
- A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv preprint arXiv:2305.17382 (2023).
- See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226 (2023).
- Training Auxiliary Prototypical Classifiers for Explainable Anomaly Detection in Medical Image Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2624–2633.
- Padim: a patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition. Springer, 475–489.
- Hanqiu Deng and Xingyu Li. 2022. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9737–9746.
- AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization. arXiv preprint arXiv:2308.15939 (2023).
- Keval Doshi and Yasin Yilmaz. 2020. Any-shot sequential anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 934–935.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1705–1714.
- AnomalyGPT: Detecting Industrial Anomalies using Large Vision-Language Models. arXiv preprint arXiv:2308.15366 (2023).
- OpenCLIP. https://doi.org/10.5281/zenodo.5143773 If you use this software, please cite it as below..
- Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19606–19616.
- Michael I Jordan et al. 1995. Why the logistic function? A tutorial discussion on probabilities and neural networks.
- Multi-Modal Classifiers for Open-Vocabulary Object Detection. arXiv preprint arXiv:2306.05493 (2023).
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026.
- Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
- Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9664–9674.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
- MuSc: Zero-Shot Industrial Anomaly Classification and Segmentation with Mutual Scoring of the Unlabeled Images. arXiv preprint arXiv:2401.16753 (2024).
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
- MOCCA: Multilayer one-class classification for anomaly detection. IEEE Transactions on Neural Networks and Learning Systems 33, 6 (2021), 2313–2323.
- Deep generative model using unregularized score for anomaly detection with heterogeneous complexity. IEEE Transactions on Cybernetics 52, 6 (2020), 5161–5173.
- Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183 (2022).
- SINGA: A Distributed Deep Learning Platform. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM. ACM, 685–688.
- Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14372–14381.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13576–13586.
- Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14318–14328.
- Deep one-class classification. In International conference on machine learning. PMLR, 4393–4402.
- Learning to share visual appearance for multiclass object detection. In CVPR 2011. IEEE, 1481–1488.
- Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14902–14912.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
- Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023).
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023).
- Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479–6488.
- Masato Tamura. 2023. Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection. arXiv preprint arXiv:2308.11119 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Jihun Yi and Sungroh Yoon. 2020. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian conference on computer vision.
- A unified model for multi-class anomaly detection. Advances in Neural Information Processing Systems 35 (2022), 4571–4584.
- IFSeg: Image-free Semantic Segmentation via Vision-Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2967–2977.
- Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8330–8339.
- Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15211–15222.
- DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3914–3923.
- Anomaly detection for medical images using self-supervised and translation-consistent features. IEEE Transactions on Medical Imaging 40, 12 (2021), 3641–3651.
- Ying Zhao. 2022. Just noticeable learning for unsupervised anomaly localization and detection. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 01–06.
- Extract free dense labels from clip. In European Conference on Computer Vision. Springer, 696–712.
- Encoding structure-texture relation with p-net for anomaly detection in retinal images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 360–377.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816–16825.
- Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961 (2023).
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
- Adaptive aggregation-distillation autoencoder for unsupervised anomaly detection. Pattern Recognition 131 (2022), 108897.
- Jiawen Zhu and Guansong Pang. 2024. Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts. arXiv preprint arXiv:2403.06495 (2024).
- Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision. Springer, 392–408.
- Jiaqi Zhu (28 papers)
- Shaofeng Cai (21 papers)
- Fang Deng (14 papers)
- Junran Wu (17 papers)
- Beng Chin Ooi (79 papers)