FoundAD: Few-shot Anomaly Detection Framework

Updated 25 March 2026

FoundAD is a few-shot anomaly detection framework that uses latent feature deviations to identify out-of-distribution regions.
It employs a lightweight 6-layer Vision Transformer projection operator on self-supervised encoder embeddings for efficient multi-class anomaly detection.
Benchmark results on datasets like MVTec-AD and VisA demonstrate state-of-the-art performance with minimal labeled data and rapid convergence.

FoundAD is a few-shot anomaly detection framework leveraging foundation visual encoders to identify out-of-distribution regions in industrial and general visual datasets. It eschews pixel reconstruction and generative modeling, instead exploiting the feature manifold hypothesized to be learned by large-scale self-supervised transformers such as DINOv3. A distinct element is the use of a lightweight nonlinear projection operator that is trained in latent space, enabling competitive and parameter-efficient multi-class anomaly detection with minimal labeled data and no reliance on text inputs or large generative models (Zhai et al., 2 Oct 2025).

1. Theoretical Rationale and Conceptual Overview

FoundAD is predicated on the assumption that foundation visual encoders, particularly self-supervised vision transformers, develop a structured manifold of normal images in their embedding space. Deviations from this natural image manifold—i.e., differences between an input’s encoding and its projection onto the manifold—correlate with the presence and extent of anomalies. As such, FoundAD formulates anomaly detection purely in terms of operations over latent representations:

No reconstruction of input pixels is performed.
Anomaly scoring is a function of embedding-space discrepancies.

The core technical novelty is a trainable, nonlinear projection operator (a 6-layer ViT) that 'denoises' potentially anomalous embeddings, facilitating few-shot adaptation and robust detection with modest supervision.

2. Workflow and Mathematical Formulation

The FoundAD pipeline comprises both a training phase incorporating anomaly synthesis and a test-time inference procedure.

Key Pipeline Steps

Input: A small set of $K$ defect-free images $\{I_r^1,\dots,I_r^K\}$ .
Anomaly Synthesis: Creation of synthetic anomalous images $I_s$ from each $I_r$ using CutPaste applied to foreground patches determined by adaptive thresholding.
Encoding: Extraction of patch-wise embeddings via a frozen foundation encoder $\theta$ , yielding $f_r = \theta(I_r)$ and $f_s = \theta(I_s)$ with $f_r, f_s \in \mathbb{R}^{N \times d}$ ( $N=1024$ patches for $32\times32$ grid, $d=768$ for DINOv3 ViT-B).
Projection: Application of the nonlinear projector $P$ (parameterized by $\phi$ ) to map $f_s$ toward normality: $f_r^* = P(f_s)$ .
Training Objective: Patch-wise $L^2$ loss $\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \|f_{r,i}^* - f_{r,i}\|_2^2$ , plus optional weight decay.
Inference: For any test image $I_a$ , extract $f_a = \theta(I_a)$ , compute $f_a^* = P(f_a)$ , and obtain patch-level anomaly scores $S_{\mathrm{patch},i} = \|f_{a,i} - f_{a,i}^*\|_2^2$ .

An aggregate image-level anomaly score is computed as the mean of the top- $K$ patch anomaly scores. Pixel-level heatmaps are generated by upsampling these patch-wise scores.

Training and Inference Algorithms

Step	Training	Inference
Encoding	$f_r \leftarrow \theta(I_r)$ , $f_s \leftarrow \theta(I_s)$	$f_a \leftarrow \theta(I_a)$
Projection	$f_r^* \leftarrow P_\phi(f_s)$	$f_a^* \leftarrow P_\phi(f_a)$
Loss/Score	$\mathcal{L} = (1/N) \\|f_r^* - f_r\\|^2_2$	$S_{\mathrm{patch},i} = \\|f_a[i] - f_a^*[i]\\|^2_2$
Output	Update $\phi$ (Adam, weight decay $1e$-$4$)	$S_{\mathrm{image}} =$ mean of top- $K$ $S_{\mathrm{patch}}$ ; upsampled heatmap

3. Nonlinear Projection Operator Design

The projector $P$ operates entirely in the embedding space, with the following characteristics:

$P\colon \mathbb{R}^{N\times d} \to \mathbb{R}^{N\times d}$ , processing all image patches jointly.
Implementation: 6-layer Vision Transformer, 12 self-attention heads, hidden dimension $d=768$ , residual connections.
Training: Minimization of $\mathcal{L}(\phi) = \frac{1}{N}\sum_{i=1}^N \|P_\phi(f_s)_i - f_{r,i}\|_2^2 + \lambda\|\phi\|^2$ , with $\lambda=10^{-4}$ .
All computations are performed with the encoder $\theta$ frozen.

No pixel-space reconstruction or adversarial/generative losses are involved. This latent projection paradigm enables computational efficiency and rapid convergence (often in under 10 epochs).

4. Experimental Evaluation and Benchmarking

FoundAD is validated on standard few-shot industrial and general anomaly detection datasets:

Dataset	Classes	Images	I-AUROC	P-AUROC	PRO	Annotation
MVTec-AD	15	5,354	96.1%	96.8%	92.8%	1,725 anomalies
VisA	12	10,821	92.6%	99.7%	98.0%	1,200 anomalies

Main baselines: SPADE, PatchCore, FastRecon, WinCLIP, PromptAD, AnomalySD, IIPAD (multi-class); WinCLIP, InCTRL, AnomalyCLIP, PromptAD, LogSAD (one-class).
Key metrics: Image-level AUROC (I-AUROC), Pixel-level AUROC (P-AUROC), AUPR, Per-Region-Overlap (PRO).
FoundAD achieves state-of-the-art results in 1-shot settings: for MVTec-AD, +1.9% I-AUROC and +3.0% PRO over previous best (IIPAD). For VisA, +2.8% pixel AUROC over next best.

Performance remains robust with more shots (2-, 4-shot), yielding further gains of up to +1% per metric. The approach outperforms prompt-based and diffusion-based methods using only visual embeddings and the single nonlinear projector.

5. Ablation Analysis

Extensive ablation studies isolate the architectural and hyperparameter contributions to overall performance:

Foundation Encoder Choice (MVTec-AD 1-shot):
- DINOv3 ViT-B (no text): I-AUROC 96.1%.
- DINOv2 ViT-B: 95.2%.
- SigLIP ViT-B: 87.8%.
- CLIP ViT-B: 79.0% (pixel-poor).
- WideResNet: 73.1%.
- This suggests purely visual, self-supervised transformers (DINO series) capture the most useful manifolds for anomaly detection.
DINOv3 Layer Selection:
- Layer 6: 91.2% I-AUROC (too shallow).
- Layer 10: 96.1% (optimal).
- Layer 12: 93.7% (too abstract).
Projector Architecture:
- ViT blocks, depth=6: 96.1% I-AUROC (best).
- MLP, depth=6: 92.1%.
- A plausible implication is that self-attention enables superior cross-patch relationship modeling for anomaly correction.
Top-K Aggregation Parameter:
- Varying $K \in \{1, 2, 4, 6, 10, 14, 20\}$ : best at $K=10$ for MVTec-AD, $K=6$ for VisA.
Number of Shots:
- 1-shot: PRO 92.8%.
- 2-shot: 93.3%.
- 4-shot: 93.5%.
- This suggests diminishing but consistent benefit with additional normal samples.

6. Implementation Details

Image Processing: Resize to $512\times512$ ; patch size 16 ( $N=1024$ tokens).
Encoder: Frozen DINOv3 ViT-Base, use layer 10 output.
Projector: 6-layer ViT, hidden dim 768, 12 heads, residuals.
Training: Adam optimizer, learning rate $1e^{-3}$ , batch size 8, 10 epochs, weight decay $1e^{-4}$ .
Anomaly Synthesis: CutPaste applied to adaptively thresholded foreground with synth probability $\sigma=0.5$ .
Top-K: $K=10$ (MVTec-AD), $K=6$ (VisA).
Hardware: Single NVIDIA RTX 3090, peak memory $\sim$ 1.4 GiB, inference $\sim$ 7.8 fps.
Parameter Count: $\sim$ 12M trainable (projector only).
Public Code: https://github.com/ymxlzgy/FoundAD

7. Context, Applicability, and Noted Limitations

FoundAD reframes the role of foundation encoders in few-shot anomaly detection, demonstrating that a lightweight projector on latent features is a viable alternative to text-prompting and generative reconstruction. It is agnostic to object category and exhibits minimal data- and compute-dependence for adaptation, making it well-suited for industrial inspection and other few-shot scenarios.

By decoupling the task from pixel space and generative modeling, FoundAD achieves competitive performance with reduced resource requirements and complexity. Requisite limitations include implicit dependence on the quality of the encoder’s feature manifold and reliance on in-manifold anomaly synthesis modes for effective training (Zhai et al., 2 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FoundAD.