FoundAD: Few-shot Anomaly Detection Framework
- FoundAD is a few-shot anomaly detection framework that uses latent feature deviations to identify out-of-distribution regions.
- It employs a lightweight 6-layer Vision Transformer projection operator on self-supervised encoder embeddings for efficient multi-class anomaly detection.
- Benchmark results on datasets like MVTec-AD and VisA demonstrate state-of-the-art performance with minimal labeled data and rapid convergence.
FoundAD is a few-shot anomaly detection framework leveraging foundation visual encoders to identify out-of-distribution regions in industrial and general visual datasets. It eschews pixel reconstruction and generative modeling, instead exploiting the feature manifold hypothesized to be learned by large-scale self-supervised transformers such as DINOv3. A distinct element is the use of a lightweight nonlinear projection operator that is trained in latent space, enabling competitive and parameter-efficient multi-class anomaly detection with minimal labeled data and no reliance on text inputs or large generative models (Zhai et al., 2 Oct 2025).
1. Theoretical Rationale and Conceptual Overview
FoundAD is predicated on the assumption that foundation visual encoders, particularly self-supervised vision transformers, develop a structured manifold of normal images in their embedding space. Deviations from this natural image manifold—i.e., differences between an input’s encoding and its projection onto the manifold—correlate with the presence and extent of anomalies. As such, FoundAD formulates anomaly detection purely in terms of operations over latent representations:
- No reconstruction of input pixels is performed.
- Anomaly scoring is a function of embedding-space discrepancies.
The core technical novelty is a trainable, nonlinear projection operator (a 6-layer ViT) that 'denoises' potentially anomalous embeddings, facilitating few-shot adaptation and robust detection with modest supervision.
2. Workflow and Mathematical Formulation
The FoundAD pipeline comprises both a training phase incorporating anomaly synthesis and a test-time inference procedure.
Key Pipeline Steps
- Input: A small set of defect-free images .
- Anomaly Synthesis: Creation of synthetic anomalous images from each using CutPaste applied to foreground patches determined by adaptive thresholding.
- Encoding: Extraction of patch-wise embeddings via a frozen foundation encoder , yielding and with ( patches for grid, for DINOv3 ViT-B).
- Projection: Application of the nonlinear projector (parameterized by ) to map toward normality: .
- Training Objective: Patch-wise loss , plus optional weight decay.
- Inference: For any test image , extract , compute , and obtain patch-level anomaly scores .
An aggregate image-level anomaly score is computed as the mean of the top- patch anomaly scores. Pixel-level heatmaps are generated by upsampling these patch-wise scores.
Training and Inference Algorithms
| Step | Training | Inference |
|---|---|---|
| Encoding | , | |
| Projection | ||
| Loss/Score | ||
| Output | Update (Adam, weight decay $1e$-$4$) | mean of top- ; upsampled heatmap |
3. Nonlinear Projection Operator Design
The projector operates entirely in the embedding space, with the following characteristics:
- , processing all image patches jointly.
- Implementation: 6-layer Vision Transformer, 12 self-attention heads, hidden dimension , residual connections.
- Training: Minimization of , with .
- All computations are performed with the encoder frozen.
No pixel-space reconstruction or adversarial/generative losses are involved. This latent projection paradigm enables computational efficiency and rapid convergence (often in under 10 epochs).
4. Experimental Evaluation and Benchmarking
FoundAD is validated on standard few-shot industrial and general anomaly detection datasets:
| Dataset | Classes | Images | I-AUROC | P-AUROC | PRO | Annotation |
|---|---|---|---|---|---|---|
| MVTec-AD | 15 | 5,354 | 96.1% | 96.8% | 92.8% | 1,725 anomalies |
| VisA | 12 | 10,821 | 92.6% | 99.7% | 98.0% | 1,200 anomalies |
- Main baselines: SPADE, PatchCore, FastRecon, WinCLIP, PromptAD, AnomalySD, IIPAD (multi-class); WinCLIP, InCTRL, AnomalyCLIP, PromptAD, LogSAD (one-class).
- Key metrics: Image-level AUROC (I-AUROC), Pixel-level AUROC (P-AUROC), AUPR, Per-Region-Overlap (PRO).
- FoundAD achieves state-of-the-art results in 1-shot settings: for MVTec-AD, +1.9% I-AUROC and +3.0% PRO over previous best (IIPAD). For VisA, +2.8% pixel AUROC over next best.
Performance remains robust with more shots (2-, 4-shot), yielding further gains of up to +1% per metric. The approach outperforms prompt-based and diffusion-based methods using only visual embeddings and the single nonlinear projector.
5. Ablation Analysis
Extensive ablation studies isolate the architectural and hyperparameter contributions to overall performance:
- Foundation Encoder Choice (MVTec-AD 1-shot):
- DINOv3 Layer Selection:
- Layer 6: 91.2% I-AUROC (too shallow).
- Layer 10: 96.1% (optimal).
- Layer 12: 93.7% (too abstract).
- Projector Architecture:
- ViT blocks, depth=6: 96.1% I-AUROC (best).
- MLP, depth=6: 92.1%.
- A plausible implication is that self-attention enables superior cross-patch relationship modeling for anomaly correction.
- Top-K Aggregation Parameter:
- Varying : best at for MVTec-AD, for VisA.
- Number of Shots:
- 1-shot: PRO 92.8%.
- 2-shot: 93.3%.
- 4-shot: 93.5%.
- This suggests diminishing but consistent benefit with additional normal samples.
6. Implementation Details
- Image Processing: Resize to ; patch size 16 ( tokens).
- Encoder: Frozen DINOv3 ViT-Base, use layer 10 output.
- Projector: 6-layer ViT, hidden dim 768, 12 heads, residuals.
- Training: Adam optimizer, learning rate , batch size 8, 10 epochs, weight decay .
- Anomaly Synthesis: CutPaste applied to adaptively thresholded foreground with synth probability .
- Top-K: (MVTec-AD), (VisA).
- Hardware: Single NVIDIA RTX 3090, peak memory 1.4 GiB, inference 7.8 fps.
- Parameter Count: 12M trainable (projector only).
- Public Code: https://github.com/ymxlzgy/FoundAD
7. Context, Applicability, and Noted Limitations
FoundAD reframes the role of foundation encoders in few-shot anomaly detection, demonstrating that a lightweight projector on latent features is a viable alternative to text-prompting and generative reconstruction. It is agnostic to object category and exhibits minimal data- and compute-dependence for adaptation, making it well-suited for industrial inspection and other few-shot scenarios.
By decoupling the task from pixel space and generative modeling, FoundAD achieves competitive performance with reduced resource requirements and complexity. Requisite limitations include implicit dependence on the quality of the encoder’s feature manifold and reliance on in-manifold anomaly synthesis modes for effective training (Zhai et al., 2 Oct 2025).