ADSeeker: Knowledge-Infused Anomaly Detection
- ADSeeker is a knowledge-infused multimodal framework combining large vision-language models, a structured knowledge base, and advanced prompting to improve industrial anomaly detection.
- It employs the Q2K RAG pathway to retrieve and fuse context-aware defect information using cosine similarity and sophisticated clustering techniques for precise anomaly reasoning.
- The framework utilizes a Hierarchical Sparse Prompt mechanism to extract subtle, type-specific image features, achieving state-of-the-art zero-shot detection performance across diverse benchmarks.
ADSeeker is a knowledge-infused multimodal framework designed to advance the state of industrial anomaly detection (IAD) and anomaly reasoning by leveraging the synergies between large vision–LLMs, structured domain knowledge, and robust prompt methods. The system directly addresses major challenges in the industrial inspection setting, such as the deficiency of domain-specific AD knowledge in MLLM pretraining and the lack of precise, context-aware anomaly reasoning. ADSeeker delivers a plug-and-play, retrieval-augmented assistant that achieves state-of-the-art zero-shot performance across multiple benchmark datasets and supports comprehensive defect understanding.
1. System Architecture and Objectives
ADSeeker integrates multimodal LLMs (MLLMs) with a curated visual document knowledge base and advanced prompting strategies to enable both anomaly detection and fine-grained, context-aware defect reasoning. The primary architectural pathways are as follows:
- Query Image–Knowledge Retrieval-Augmented Generation (Q2K RAG) pathway: Enables retrieval-augmented, knowledge-grounded reasoning.
- Anomaly Detection Expert module: Utilizes a Hierarchical Sparse Prompt (HSP) mechanism to extract, from visual query inputs, sparse region-level and type-level features critical for distinguishing subtle anomalies.
The system is constructed to be compatible with industrial and medical inspection tasks, focusing on providing accurate detection, localization, and detailed defect explanation in settings with limited data or complex, multi-type defects.
2. Multimodal Knowledge Base: SEEK-MVTec&VisA (SEEK-M&V)
A foundational element of ADSeeker is SEEK-M&V, a multimodal knowledge base tailored for the anomaly detection domain:
- Unlike prior resources relying only on unstructured text, SEEK-M&V fuses semantic-rich, human-authored descriptions with aligned visual exemplars for each defect type.
- Each knowledge entry includes defect typology, detailed anomaly analysis, production context, and representative images, capturing both class-level and instance-level variation.
- SEEK-M&V supports fine-grained, context-aware retrieval and is directly integrated into model reasoning via Q2K RAG.
This structured resource enables ADSeeker to overcome the domain knowledge gap typical in generalist vision–LLMs and improves the capacity for technically precise explanations of industrial anomalies.
3. Query Image–Knowledge Retrieval-Augmented Generation (Q2K RAG)
Q2K RAG constitutes the knowledge retrieval and grounding backbone:
- Retrieval: For a query image , an image encoder (CLIP encoder) computes a key feature vector , while each document in SEEK-M&V yields a lock feature . Cosine similarity is calculated.
- Selection: KDE-based sampling and Bayesian Gaussian Mixture Modeling are used to cluster the similarity scores and dynamically select the most relevant knowledge document(s) .
- Augmentation: The retrieved knowledge, often comprising both modality-aligned imagery and semantic-rich defects description, is concatenated to the Q2K RAG prompt for the MLLM, providing explicit, context-highlighting information for anomaly reasoning.
This design enables ADSeeker to condition reasoning not just on query visual content but on a dynamically selected, context-relevant slice of expert knowledge.
4. Hierarchical Sparse Prompt (HSP) Mechanism and Type-Level Feature Extraction
To extract discriminative, region- and type-level information from images with diverse and subtle anomalies, ADSeeker deploys an HSP mechanism:
- Prompt Formulation: Instead of fixed object-level phrases (e.g., “photo of [obj]”), the prompt takes the form "An image with [cls] defect type", where [cls] specifies a defect type (e.g., “scratch”).
- Iterative Sparse Optimization: The prompt embedding is updated in multiple rounds:
- Let be the query image's feature representation, the text prompt embedding at iteration , and a scaling factor. The residual is computed, and the update gradient is derived.
- Using principles from the Iterative Soft Thresholding Algorithm, the prompt update is , where is the largest singular value and the sparsity regularization.
- The total loss is , encouraging the model to sparsely attend to the most critical defect regions.
- Type-Level Guidance: The mechanism leverages type-level textual descriptors, promoting generalization even to unseen or sparsely annotated defect types.
The HSP module enables ADSeeker to robustly extract and localize subtle, heterogeneous anomalies by focusing model attention along defect-relevant dimensions.
5. Multi-type Anomaly Dataset (MulA)
To address the scarcity of diverse, large-scale, and well-annotated industrial anomaly data, the authors introduce MulA:
- MulA contains 11,226 RGB, grayscale, and X-ray images from 26 categories with 72 multi-scale, multi-type defect annotations.
- Annotations include high-quality defect-region masks, and extensive data augmentation (geometric transformations, noise injection, etc.) ensures robustness and diversity.
- The dataset spans industry, workshop, and general contexts, supporting both image-level anomaly detection and pixel-level localization.
MulA is the most comprehensive IAD dataset described to date, directly supporting the benchmarking and development of generalizable anomaly detection models.
6. Experimental Results and Zero-Shot Performance
ADSeeker achieves state-of-the-art results on a range of industrial and medical anomaly detection benchmarks:
- On industrial datasets (MVTec AD, VisA, BTAD, MPDD), ADSeeker achieves AUROC scores in the mid-90% range for image-level detection, outperforming methods such as CLIP, AnomalyCLIP, WinCLIP, and AdaCLIP.
- In anomaly reasoning (MMAD benchmark), ADSeeker (specifically, Qwen2.5-VL with SEEK-Setting) reaches ~69.9% overall accuracy, with notable improvements in both defect classification and localization, attributed to its fused anomaly priors and knowledge-grounded reasoning.
- Efficiency metrics demonstrate that ADSeeker’s augmentation modules introduce ≤27% memory overhead and an average increase of ~2 seconds in inference latency, rendering it suitable for industrial pipelines without significant performance compromise.
These experimental observations validate the hypothesis that knowledge-grounded multimodal retrieval and sparse prompt learning substantially improve both detection and reasoning capabilities in IAD, particularly in zero-shot and few-shot regimes.
7. Framework Significance and Implications
ADSeeker represents a methodological advance in industrial anomaly detection and inspection reasoning:
- The hybridization of structured, multimodal knowledge bases (SEEK-M&V), retrieval-augmented prompting (Q2K RAG), and hierarchical sparse prompt optimization closes major gaps left by traditional, text-centric or unimodal anomaly detection systems.
- The plug-and-play architectural paradigm allows seamless integration with a wide range of base MLLMs and accommodates newly emerging defect types without extensive retraining, making the approach modular and scalable.
- By emphasizing context-aware, technically precise language generation underpinned by retrieved, domain-specific expertise, ADSeeker moves anomaly explanation beyond generic descriptions and towards actionable, expert-like interpretability.
A plausible implication is that frameworks like ADSeeker could set a new baseline for knowledge-grounded, multimodal inspection assistants in domains requiring expert-level anomaly detection and explanation, such as quality control, safety inspection, and medical diagnosis.