FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization (2501.10067v1)

Published 17 Jan 2025 in cs.CV

Abstract: Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes LLMs to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.

Summary

The paper introduces FiLo++, a zero-/few-shot anomaly detection method combining LLM-generated fine-grained descriptions and a deformable localization module.
FiLo++ utilizes Fused Fine-Grained Descriptions (FusDes) via LLMs for anomaly narratives and Deformable Localization (DefLoc) for precise spatial detection.
Results show FiLo++ achieves high performance, like 84.5% image AUC and 96.2% pixel AUC zero-shot on VisA, improving interpretability and applicability in data-scarce fields.

Zero-/Few-Shot Anomaly Detection: Enhancements through FiLo++

The paper "FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization" introduces an innovative approach to the domain of anomaly detection (AD) by mitigating reliance on extensive datasets of normal samples. This method addresses limitations in traditional anomaly detection, particularly in scenarios like cold starts, where data from the target class is sparse or unavailable at the outset.

Core Concepts

FiLo++ leverages LLMs to generate detailed textual anomaly descriptions tailored to specific object categories, optimizing for both zero- and few-shot anomaly detection tasks. The key components of the FiLo++ method are:

Fused Fine-Grained Descriptions (FusDes): This utilizes LLM technology to generate detailed anomaly narratives for object categories. It integrates both fixed and learnable prompts, further fine-tuning with a runtime prompt filtering technique to yield enhanced anomaly-text congruence.
Deformable Localization (DefLoc): This component employs the vision foundation model Grounding DINO, combining it with position-enhanced text descriptions and a multi-scale deformable cross-modal interaction module (MDCI) to locate anomalies more accurately, irrespective of their shape or size.

Results and Implications

The implementation of FiLo++ demonstrated marked improvements in anomaly detection capabilities on multiple datasets. For instance, FiLo++ achieved an image-level AUC of 84.5% and a pixel-level AUC of 96.2% in zero-shot settings on the VisA dataset. These metrics underscore its efficacy—not only in the binary classification of anomalies but in accurately pinpointing the extent of abnormal regions.

The paper’s approach introduces crucial enhancements in the interpretability and flexibility of anomaly detection systems. By integrating LLMs, FiLo++ is able to produce textual descriptions that align more closely with the visual traits of anomalies, enabling precise zero-shot performance without need for prior task-specific data. Similarly, the inclusion of deformable localization components allows the method to break beyond the limitations of simple patch-level analysis, improving localization in complex scenarios.

Theoretical and Practical Implications

FiLo++ is effectively transforming the theoretical landscape of anomaly detection by demonstrating the synergy between LLMs and multimodal vision tasks. The practical impact spans various domains, from manufacturing defect recognition to medical image analysis, where anomalies might not have predefined training sets. Future advancements in this domain could further exploit the full spectrum and flexibility of LLMs and foundational vision models, leading to systems that require even less input data to achieve high-performance anomaly detection.

The potential to generalize across various types of datasets suggests significant advancements for applications in industries that face rapidly changing conditions or those that require immediate anomaly detection upon deployment.

This paper not only presents empirical findings that contribute to the ongoing growth of anomaly detection research but also sets a precedent displaying how synergizing LLMs with vision systems can foster meaningful advancements in practical machine learning applications.