Papers
Topics
Authors
Recent
Search
2000 character limit reached

FoundAD: Few-shot Anomaly Detection Framework

Updated 25 March 2026
  • FoundAD is a few-shot anomaly detection framework that uses latent feature deviations to identify out-of-distribution regions.
  • It employs a lightweight 6-layer Vision Transformer projection operator on self-supervised encoder embeddings for efficient multi-class anomaly detection.
  • Benchmark results on datasets like MVTec-AD and VisA demonstrate state-of-the-art performance with minimal labeled data and rapid convergence.

FoundAD is a few-shot anomaly detection framework leveraging foundation visual encoders to identify out-of-distribution regions in industrial and general visual datasets. It eschews pixel reconstruction and generative modeling, instead exploiting the feature manifold hypothesized to be learned by large-scale self-supervised transformers such as DINOv3. A distinct element is the use of a lightweight nonlinear projection operator that is trained in latent space, enabling competitive and parameter-efficient multi-class anomaly detection with minimal labeled data and no reliance on text inputs or large generative models (Zhai et al., 2 Oct 2025).

1. Theoretical Rationale and Conceptual Overview

FoundAD is predicated on the assumption that foundation visual encoders, particularly self-supervised vision transformers, develop a structured manifold of normal images in their embedding space. Deviations from this natural image manifold—i.e., differences between an input’s encoding and its projection onto the manifold—correlate with the presence and extent of anomalies. As such, FoundAD formulates anomaly detection purely in terms of operations over latent representations:

  • No reconstruction of input pixels is performed.
  • Anomaly scoring is a function of embedding-space discrepancies.

The core technical novelty is a trainable, nonlinear projection operator (a 6-layer ViT) that 'denoises' potentially anomalous embeddings, facilitating few-shot adaptation and robust detection with modest supervision.

2. Workflow and Mathematical Formulation

The FoundAD pipeline comprises both a training phase incorporating anomaly synthesis and a test-time inference procedure.

Key Pipeline Steps

  1. Input: A small set of KK defect-free images {Ir1,,IrK}\{I_r^1,\dots,I_r^K\}.
  2. Anomaly Synthesis: Creation of synthetic anomalous images IsI_s from each IrI_r using CutPaste applied to foreground patches determined by adaptive thresholding.
  3. Encoding: Extraction of patch-wise embeddings via a frozen foundation encoder θ\theta, yielding fr=θ(Ir)f_r = \theta(I_r) and fs=θ(Is)f_s = \theta(I_s) with fr,fsRN×df_r, f_s \in \mathbb{R}^{N \times d} (N=1024N=1024 patches for 32×3232\times32 grid, d=768d=768 for DINOv3 ViT-B).
  4. Projection: Application of the nonlinear projector PP (parameterized by ϕ\phi) to map fsf_s toward normality: fr=P(fs)f_r^* = P(f_s).
  5. Training Objective: Patch-wise L2L^2 loss L=1Ni=1Nfr,ifr,i22\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \|f_{r,i}^* - f_{r,i}\|_2^2, plus optional weight decay.
  6. Inference: For any test image IaI_a, extract fa=θ(Ia)f_a = \theta(I_a), compute fa=P(fa)f_a^* = P(f_a), and obtain patch-level anomaly scores Spatch,i=fa,ifa,i22S_{\mathrm{patch},i} = \|f_{a,i} - f_{a,i}^*\|_2^2.

An aggregate image-level anomaly score is computed as the mean of the top-KK patch anomaly scores. Pixel-level heatmaps are generated by upsampling these patch-wise scores.

Training and Inference Algorithms

Step Training Inference
Encoding frθ(Ir)f_r \leftarrow \theta(I_r), fsθ(Is)f_s \leftarrow \theta(I_s) faθ(Ia)f_a \leftarrow \theta(I_a)
Projection frPϕ(fs)f_r^* \leftarrow P_\phi(f_s) faPϕ(fa)f_a^* \leftarrow P_\phi(f_a)
Loss/Score L=(1/N)frfr22\mathcal{L} = (1/N) \|f_r^* - f_r\|^2_2 Spatch,i=fa[i]fa[i]22S_{\mathrm{patch},i} = \|f_a[i] - f_a^*[i]\|^2_2
Output Update ϕ\phi (Adam, weight decay $1e$-$4$) Simage=S_{\mathrm{image}} = mean of top-KK SpatchS_{\mathrm{patch}}; upsampled heatmap

3. Nonlinear Projection Operator Design

The projector PP operates entirely in the embedding space, with the following characteristics:

  • P ⁣:RN×dRN×dP\colon \mathbb{R}^{N\times d} \to \mathbb{R}^{N\times d}, processing all image patches jointly.
  • Implementation: 6-layer Vision Transformer, 12 self-attention heads, hidden dimension d=768d=768, residual connections.
  • Training: Minimization of L(ϕ)=1Ni=1NPϕ(fs)ifr,i22+λϕ2\mathcal{L}(\phi) = \frac{1}{N}\sum_{i=1}^N \|P_\phi(f_s)_i - f_{r,i}\|_2^2 + \lambda\|\phi\|^2, with λ=104\lambda=10^{-4}.
  • All computations are performed with the encoder θ\theta frozen.

No pixel-space reconstruction or adversarial/generative losses are involved. This latent projection paradigm enables computational efficiency and rapid convergence (often in under 10 epochs).

4. Experimental Evaluation and Benchmarking

FoundAD is validated on standard few-shot industrial and general anomaly detection datasets:

Dataset Classes Images I-AUROC P-AUROC PRO Annotation
MVTec-AD 15 5,354 96.1% 96.8% 92.8% 1,725 anomalies
VisA 12 10,821 92.6% 99.7% 98.0% 1,200 anomalies
  • Main baselines: SPADE, PatchCore, FastRecon, WinCLIP, PromptAD, AnomalySD, IIPAD (multi-class); WinCLIP, InCTRL, AnomalyCLIP, PromptAD, LogSAD (one-class).
  • Key metrics: Image-level AUROC (I-AUROC), Pixel-level AUROC (P-AUROC), AUPR, Per-Region-Overlap (PRO).
  • FoundAD achieves state-of-the-art results in 1-shot settings: for MVTec-AD, +1.9% I-AUROC and +3.0% PRO over previous best (IIPAD). For VisA, +2.8% pixel AUROC over next best.

Performance remains robust with more shots (2-, 4-shot), yielding further gains of up to +1% per metric. The approach outperforms prompt-based and diffusion-based methods using only visual embeddings and the single nonlinear projector.

5. Ablation Analysis

Extensive ablation studies isolate the architectural and hyperparameter contributions to overall performance:

  • Foundation Encoder Choice (MVTec-AD 1-shot):
    • DINOv3 ViT-B (no text): I-AUROC 96.1%.
    • DINOv2 ViT-B: 95.2%.
    • SigLIP ViT-B: 87.8%.
    • CLIP ViT-B: 79.0% (pixel-poor).
    • WideResNet: 73.1%.
    • This suggests purely visual, self-supervised transformers (DINO series) capture the most useful manifolds for anomaly detection.
  • DINOv3 Layer Selection:
    • Layer 6: 91.2% I-AUROC (too shallow).
    • Layer 10: 96.1% (optimal).
    • Layer 12: 93.7% (too abstract).
  • Projector Architecture:
    • ViT blocks, depth=6: 96.1% I-AUROC (best).
    • MLP, depth=6: 92.1%.
    • A plausible implication is that self-attention enables superior cross-patch relationship modeling for anomaly correction.
  • Top-K Aggregation Parameter:
    • Varying K{1,2,4,6,10,14,20}K \in \{1, 2, 4, 6, 10, 14, 20\}: best at K=10K=10 for MVTec-AD, K=6K=6 for VisA.
  • Number of Shots:
    • 1-shot: PRO 92.8%.
    • 2-shot: 93.3%.
    • 4-shot: 93.5%.
    • This suggests diminishing but consistent benefit with additional normal samples.

6. Implementation Details

  • Image Processing: Resize to 512×512512\times512; patch size 16 (N=1024N=1024 tokens).
  • Encoder: Frozen DINOv3 ViT-Base, use layer 10 output.
  • Projector: 6-layer ViT, hidden dim 768, 12 heads, residuals.
  • Training: Adam optimizer, learning rate 1e31e^{-3}, batch size 8, 10 epochs, weight decay 1e41e^{-4}.
  • Anomaly Synthesis: CutPaste applied to adaptively thresholded foreground with synth probability σ=0.5\sigma=0.5.
  • Top-K: K=10K=10 (MVTec-AD), K=6K=6 (VisA).
  • Hardware: Single NVIDIA RTX 3090, peak memory \sim1.4 GiB, inference \sim7.8 fps.
  • Parameter Count: \sim12M trainable (projector only).
  • Public Code: https://github.com/ymxlzgy/FoundAD

7. Context, Applicability, and Noted Limitations

FoundAD reframes the role of foundation encoders in few-shot anomaly detection, demonstrating that a lightweight projector on latent features is a viable alternative to text-prompting and generative reconstruction. It is agnostic to object category and exhibits minimal data- and compute-dependence for adaptation, making it well-suited for industrial inspection and other few-shot scenarios.

By decoupling the task from pixel space and generative modeling, FoundAD achieves competitive performance with reduced resource requirements and complexity. Requisite limitations include implicit dependence on the quality of the encoder’s feature manifold and reliance on in-manifold anomaly synthesis modes for effective training (Zhai et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FoundAD.