Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
33 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
78 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

ADSeeker: Knowledge-Infused Anomaly Detection

Updated 9 August 2025
  • ADSeeker is a knowledge-infused multimodal framework combining large vision-language models, a structured knowledge base, and advanced prompting to improve industrial anomaly detection.
  • It employs the Q2K RAG pathway to retrieve and fuse context-aware defect information using cosine similarity and sophisticated clustering techniques for precise anomaly reasoning.
  • The framework utilizes a Hierarchical Sparse Prompt mechanism to extract subtle, type-specific image features, achieving state-of-the-art zero-shot detection performance across diverse benchmarks.

ADSeeker is a knowledge-infused multimodal framework designed to advance the state of industrial anomaly detection (IAD) and anomaly reasoning by leveraging the synergies between large vision–LLMs, structured domain knowledge, and robust prompt methods. The system directly addresses major challenges in the industrial inspection setting, such as the deficiency of domain-specific AD knowledge in MLLM pretraining and the lack of precise, context-aware anomaly reasoning. ADSeeker delivers a plug-and-play, retrieval-augmented assistant that achieves state-of-the-art zero-shot performance across multiple benchmark datasets and supports comprehensive defect understanding.

1. System Architecture and Objectives

ADSeeker integrates multimodal LLMs (MLLMs) with a curated visual document knowledge base and advanced prompting strategies to enable both anomaly detection and fine-grained, context-aware defect reasoning. The primary architectural pathways are as follows:

  • Query Image–Knowledge Retrieval-Augmented Generation (Q2K RAG) pathway: Enables retrieval-augmented, knowledge-grounded reasoning.
  • Anomaly Detection Expert module: Utilizes a Hierarchical Sparse Prompt (HSP) mechanism to extract, from visual query inputs, sparse region-level and type-level features critical for distinguishing subtle anomalies.

The system is constructed to be compatible with industrial and medical inspection tasks, focusing on providing accurate detection, localization, and detailed defect explanation in settings with limited data or complex, multi-type defects.

2. Multimodal Knowledge Base: SEEK-MVTec&VisA (SEEK-M&V)

A foundational element of ADSeeker is SEEK-M&V, a multimodal knowledge base tailored for the anomaly detection domain:

  • Unlike prior resources relying only on unstructured text, SEEK-M&V fuses semantic-rich, human-authored descriptions with aligned visual exemplars for each defect type.
  • Each knowledge entry includes defect typology, detailed anomaly analysis, production context, and representative images, capturing both class-level and instance-level variation.
  • SEEK-M&V supports fine-grained, context-aware retrieval and is directly integrated into model reasoning via Q2K RAG.

This structured resource enables ADSeeker to overcome the domain knowledge gap typical in generalist vision–LLMs and improves the capacity for technically precise explanations of industrial anomalies.

3. Query Image–Knowledge Retrieval-Augmented Generation (Q2K RAG)

Q2K RAG constitutes the knowledge retrieval and grounding backbone:

  • Retrieval: For a query image IqI_q, an image encoder (CLIP encoder) computes a key feature vector KQK_Q, while each document DiD_i in SEEK-M&V yields a lock feature LiL_i. Cosine similarity Si=cos(KQ,Li)S_i = \cos(K_Q, L_i) is calculated.
  • Selection: KDE-based sampling and Bayesian Gaussian Mixture Modeling are used to cluster the similarity scores and dynamically select the most relevant knowledge document(s) LansL_{ans}.
  • Augmentation: The retrieved knowledge, often comprising both modality-aligned imagery and semantic-rich defects description, is concatenated to the Q2K RAG prompt for the MLLM, providing explicit, context-highlighting information for anomaly reasoning.

This design enables ADSeeker to condition reasoning not just on query visual content but on a dynamically selected, context-relevant slice of expert knowledge.

4. Hierarchical Sparse Prompt (HSP) Mechanism and Type-Level Feature Extraction

To extract discriminative, region- and type-level information from images with diverse and subtle anomalies, ADSeeker deploys an HSP mechanism:

  • Prompt Formulation: Instead of fixed object-level phrases (e.g., “photo of [obj]”), the prompt takes the form "An image with [cls] defect type", where [cls] specifies a defect type (e.g., “scratch”).
  • Iterative Sparse Optimization: The prompt embedding is updated in multiple rounds:
    • Let EqE_q be the query image's feature representation, El(n1)E_l^{(n-1)} the text prompt embedding at iteration n1n-1, and PnP_n a scaling factor. The residual Rn=EqPnEl(n1)R_n = E_q - P_n E_l^{(n-1)} is computed, and the update gradient GnG_n is derived.
    • Using principles from the Iterative Soft Thresholding Algorithm, the prompt update is Pn=Sign(PnGn/σmax(ElTEl))(PnGn/σmax(ElTEl)λ/Gn)ElP_n^* = \text{Sign}(P_n - G_n/\sigma_{\max}(E_l^T E_l)) \cdot (|P_n - G_n/\sigma_{\max}(E_l^T E_l)| - \lambda/G_n) \cdot E_l, where σmax\sigma_{\max} is the largest singular value and λ\lambda the sparsity regularization.
    • The total loss is L=minP12EqPnEl22+λPn1\mathcal{L} = \min_P \frac{1}{2}\|E_q - P_n^* E_l\|_2^2 + \lambda\|P_n^*\|_1, encouraging the model to sparsely attend to the most critical defect regions.
  • Type-Level Guidance: The mechanism leverages type-level textual descriptors, promoting generalization even to unseen or sparsely annotated defect types.

The HSP module enables ADSeeker to robustly extract and localize subtle, heterogeneous anomalies by focusing model attention along defect-relevant dimensions.

5. Multi-type Anomaly Dataset (MulA)

To address the scarcity of diverse, large-scale, and well-annotated industrial anomaly data, the authors introduce MulA:

  • MulA contains 11,226 RGB, grayscale, and X-ray images from 26 categories with 72 multi-scale, multi-type defect annotations.
  • Annotations include high-quality defect-region masks, and extensive data augmentation (geometric transformations, noise injection, etc.) ensures robustness and diversity.
  • The dataset spans industry, workshop, and general contexts, supporting both image-level anomaly detection and pixel-level localization.

MulA is the most comprehensive IAD dataset described to date, directly supporting the benchmarking and development of generalizable anomaly detection models.

6. Experimental Results and Zero-Shot Performance

ADSeeker achieves state-of-the-art results on a range of industrial and medical anomaly detection benchmarks:

  • On industrial datasets (MVTec AD, VisA, BTAD, MPDD), ADSeeker achieves AUROC scores in the mid-90% range for image-level detection, outperforming methods such as CLIP, AnomalyCLIP, WinCLIP, and AdaCLIP.
  • In anomaly reasoning (MMAD benchmark), ADSeeker (specifically, Qwen2.5-VL with SEEK-Setting) reaches ~69.9% overall accuracy, with notable improvements in both defect classification and localization, attributed to its fused anomaly priors and knowledge-grounded reasoning.
  • Efficiency metrics demonstrate that ADSeeker’s augmentation modules introduce ≤27% memory overhead and an average increase of ~2 seconds in inference latency, rendering it suitable for industrial pipelines without significant performance compromise.

These experimental observations validate the hypothesis that knowledge-grounded multimodal retrieval and sparse prompt learning substantially improve both detection and reasoning capabilities in IAD, particularly in zero-shot and few-shot regimes.

7. Framework Significance and Implications

ADSeeker represents a methodological advance in industrial anomaly detection and inspection reasoning:

  • The hybridization of structured, multimodal knowledge bases (SEEK-M&V), retrieval-augmented prompting (Q2K RAG), and hierarchical sparse prompt optimization closes major gaps left by traditional, text-centric or unimodal anomaly detection systems.
  • The plug-and-play architectural paradigm allows seamless integration with a wide range of base MLLMs and accommodates newly emerging defect types without extensive retraining, making the approach modular and scalable.
  • By emphasizing context-aware, technically precise language generation underpinned by retrieved, domain-specific expertise, ADSeeker moves anomaly explanation beyond generic descriptions and towards actionable, expert-like interpretability.

A plausible implication is that frameworks like ADSeeker could set a new baseline for knowledge-grounded, multimodal inspection assistants in domains requiring expert-level anomaly detection and explanation, such as quality control, safety inspection, and medical diagnosis.