DINOv2-Based Few-Shot Anomaly Detectors

Updated 21 December 2025

DINOv2-based few-shot anomaly detectors are vision-centric models leveraging pre-trained patch embeddings to identify structural anomalies via training-free nearest neighbor and lightweight projection methods.
They combine frozen self-supervised Vision Transformers with minimal adaptation to achieve state-of-the-art image and pixel-level anomaly detection on industrial benchmarks like MVTec-AD and VisA.
These methods offer rapid inference and low overhead while facing challenges such as adversarial sensitivity and calibration, which can be mitigated with post-hoc scaling techniques.

DINOv2-based few-shot anomaly detectors are vision-centric architectures leveraging the high representational capacity of DINOv2, a state-of-the-art self-supervised Vision Transformer, to address anomaly detection in scarce-data regimes. These methods combine frozen DINOv2 extractors with minimal adaptation modules or entirely training-free paradigms, enabling robust image- and pixel-level anomaly detection and localization by exploiting the structure of patch embeddings. Their success demonstrates that foundation visual models alone can match or surpass multi-modal and meta-learned approaches in industrial and cross-domain contexts.

1. Core Principles and Theoretical Basis

DINOv2-based few-shot anomaly detectors are founded on the principle that the patch-level feature space of a pre-trained Vision Transformer is sufficiently structured such that anomalies—defined as statistical or semantic deviations from reference “normal” images—manifest as out-of-distribution regions or reconstruction failures in this latent space. Two key paradigms emerge:

Training-free deep nearest neighbor (“NN”) approaches: Patch-level DINOv2 embeddings from a small set of reference (“support”) images form a memory bank. At test time, each query patch is scored by its minimum distance to the reference bank, providing pixel and image-level anomaly evidence without any re-training or adaptation (Damm et al., 23 May 2024).
Lightweight learnable projection approaches: Rather than relying solely on NN-search, a small nonlinear projector is trained to “snap” distorted (potential anomaly) embeddings back onto the manifold of normal data, providing a metric for out-of-distribution deviation directly in feature space (Zhai et al., 2 Oct 2025).

Both approaches exploit the pre-trained structure of DINOv2, eschewing category-specific training or text supervisions typical of CLIP-like frameworks.

2. Methodologies and Architectural Advances

A range of architectural strategies instantiate DINOv2-based few-shot anomaly detectors. Notable systems include:

AnomalyDINO: Training-Free Nearest-Neighbor Detection

Architecture: A frozen DINOv2 ViT-S/14 (or ViT-B) encoder divides input images into $N$ nonoverlapping $p \times p$ patches; each patch yields an embedding $f_p(x) \in \mathbb{R}^d$ .
Memory Bank: All patch tokens from $k$ nominal images populate a memory bank $M$ .
Scoring: For a test image, the anomaly score for patch $j$ is $s_j = \min_{p^+ \in M} d_{\cos}(f_j, p^+)$ , with $d_{\cos}(u, v) = 1 - \frac{u \cdot v}{\|u\|_2 \|v\|_2}$ .
Image-Level Aggregation: The metric $S(x)$ is computed as the mean of the top 1% largest patch scores.
Localization: Pixel-level heatmaps are obtained through bilinear upsampling of the patch anomaly map, optionally followed by Gaussian filtering for spatial smoothing (Damm et al., 23 May 2024).

FoundAD: Few-Shot Latent-Space Projector

Backbone: Frozen DINOv2 ViT-B, input size $518 \times 518$ .
Projection Module: A 6-layer ViT manifold projector ( $\sim$ 11.8M parameters) learns a mapping $\phi_\psi$ such that anomalous features are aligned with the expected normal-image feature distribution.
Training Objective: Only normal images are seen natively; synthetic anomalies are generated via CutPaste. The model is trained with per-patch squared L2 loss, optimizing

$\mathcal{L}(\psi) = \frac{1}{N} \sum_{i=1}^N \|\phi_\psi(f_s)_i - f_{r,i}\|_2^2,$

where $f_s$ is a disturbed (synthetic anomaly) embedding and $f_r$ is the reference.

Anomaly Scoring: At inference, the per-patch anomaly score is the squared L2 norm between the original and projected feature. The mean of the top $K$ patch scores yields an image-level anomaly metric (Zhai et al., 2 Oct 2025).

Dinomaly2: Unified Reconstruction-Based Framework

Feature Integration: Frozen DINOv2 ViT-B supplies patch tokens from middle transformer layers (layers 3–10).
Noisy Bottleneck: A dropout-regularized MLP injects stochasticity, forcing decoders to generalize.
Decoder: An 8-block transformer aims to reconstruct the grouped features of normal images, subject to “loose” groupwise reconstruction and cosine distance-based losses.
Context Recoding: Patch tokens are recentered with respect to the class token.
Few-Shot Protocol: No adaptation to few-shot is required; the same architecture and loss apply whether 8 or 200 reference images are available (Guo et al., 20 Oct 2025).

Cross-Domain Extensions: NexViTAD

Feature Fusion: Hierarchical adapters combine DINOv2 with Hiera features.
Shared Subspace Projections: Shared bottleneck dimensions and skip connections enable transfer across source domains.
Anomaly Scoring: Sinkhorn-K-means clustering and adaptive thresholding produce pixel-level segmentation.
Multi-task Learning: An MTL decoder architecture improves robustness under domain shift (Mu et al., 10 Jul 2025).

3. Evaluation Protocols, Metrics, and Benchmarks

Extensive evaluations on industrial and public anomaly detection datasets validate the superiority of these DINOv2-based architectures in the few-shot regime.

Datasets:
- MVTec-AD: 15 object/texture categories, 5,354 images.
- VisA: 12 object categories, 10,821 images (multi-view).
Metrics:
- Image-level AUROC (“I-AUROC”): Area under the ROC curve, image prediction.
- Pixel-level AUROC (“P-AUROC”): Fine-grained localization performance.
- Average Precision (AP), PRO (Per-Region Overlap).
Standardized Aggregation: All leading methods aggregate patch scores by the mean of the top 1% largest responses, shown empirically to balance sensitivity with false positive control (Damm et al., 23 May 2024, Zhai et al., 2 Oct 2025, Guo et al., 20 Oct 2025).

Method	Model	1-shot I-AUROC (MVTec)	Few-shot I-AUROC (MVTec)	1-shot I-AUROC (VisA)	Pixel-AUROC (MVTec)
AnomalyDINO	DINOv2-S	96.6%	97.7% (4-shot)	87–89%	96.8%
FoundAD	DINOv2-B	95.2%	—	—	96.4%
Dinomaly2	DINOv2-B	—	98.7% (8-shot)	97.4% (8-shot)	97.5%
NexViTAD	DINOv2+Hiera	—	97.5% (few-shot)	—	95.2%

These results indicate DINOv2-driven approaches set state-of-the-art few-shot AUROC performance, typically outperforming multimodal (e.g., CLIP-based) or fully supervised competitors even with $k \leq 8$ reference images (Damm et al., 23 May 2024, Zhai et al., 2 Oct 2025, Guo et al., 20 Oct 2025, Mu et al., 10 Jul 2025).

4. Robustness, Limitations, and Uncertainty Quantification

DINOv2-based nearest-neighbor few-shot anomaly detectors are empirically highly sensitive to adversarial perturbations in the image space. FGSM white-box attacks ( $\ell_\infty$ norm $\epsilon = 8/255$ ) against models like AnomalyDINO result in AUROC drops exceeding 36% (from $\sim$ 97% to $\sim$ 60% on MVTec-AD; 92% to 52% on VisA in the 1- or 4-shot regime) (Khan et al., 15 Oct 2025). These perturbations affect the geometry of DINOv2 patch-embedding spaces enough to flip nearest-neighbor relations, causing confident misclassification.

Moreover, raw anomaly scores output by these detectors are poorly calibrated—the confidence assigned by the detector does not accurately reflect the empirical likelihood of an image being anomalous. This miscalibration is quantified by high Expected Calibration Error (ECE): e.g., ECE $_{\text{uncalibrated}}$ = 0.4261 on MVTec-AD (1-shot) (Khan et al., 15 Oct 2025). Post-hoc Platt scaling fits a logistic regression to anomaly scores, reducing ECE by an order of magnitude (to 0.0536) and providing well-behaved posterior probabilities.

Under attack, the predictive entropy of the calibrated scores increases, enabling adversarial flagging: in the MVTec-AD 1-shot regime, entropy rises from $H_\mathrm{clean} \approx 0.122$ to $H_\mathrm{adv} \approx 0.490$ . Thresholding on entropy allows practical detection of suspicious inputs (Khan et al., 15 Oct 2025).

Observed Limitations

Purely visual foundation models sometimes fail on semantic anomalies undetectable through low-level features (e.g., “cable swap” errors), achieving AUROC $\approx$ 50% for such cases (Damm et al., 23 May 2024).
In the strict one-shot regime, if the reference fails to encode all salient features (e.g., missing printed text), anomalous test samples may not be recognized (Damm et al., 23 May 2024).
Nearest-neighbor and reconstruction-based detectors are particularly exposed to adversarial attacks, necessitating additional calibration or defenses in critical deployments (Khan et al., 15 Oct 2025).

5. Implementation and Practical Deployment

DINOv2-based anomaly systems are noted for low implementation complexity, rapid inference, and near-immediate field deployment:

No full model fine-tuning: All heavy network layers are frozen.
Low overhead: Patch banks require only modest memory ( $\sim$ 1.5 MB per $k=1$ reference image at $n=1024$ patches, $384$-dimensional) (Damm et al., 23 May 2024).
Fast inference: AnomalyDINO achieves throughput of $\sim$ 16.7 fps (448px in, 1-shot) on an NVIDIA A40 GPU; FoundAD inference is $\sim$ 130 ms/image on RTX 3090 (Zhai et al., 2 Oct 2025).
Hardware/Frameworks: PyTorch; vectorized patch distance computation (e.g., torch.topk); nearest-neighbor via Faiss (Zhai et al., 2 Oct 2025, Damm et al., 23 May 2024).
No prompt or language supervision: Unlike vision-LLMs, DINOv2-based detectors are entirely vision-only, requiring no text labels or prompts in any stage (Damm et al., 23 May 2024).

Typical Few-Shot Protocol

Collect $k$ clean support images per class or category; extract DINOv2 patch token memory bank.
At test time, segment or score each query by computing deep nearest-neighbor or projection-based anomaly metrics.
Aggregate per-patch scores with the “mean of top 1%” statistic for robust image-level detection and use bilinear–Gaussian upsampling for fine-grained localization.
Optionally calibrate scores via held-out splits and Platt scaling for uncertainty-aware deployment (Zhai et al., 2 Oct 2025, Damm et al., 23 May 2024, Khan et al., 15 Oct 2025).

6. Emerging Directions and Cross-Domain Generalization

Recent work pursues extending DINOv2-based detectors for full-spectrum anomaly detection:

Unified frameworks: Dinomaly2 demonstrates that a single model, with minimal adaptation, can extend across modalities (2D, 3D, IR, RGB-3D), object categories, and task regimes (single-class, multi-class, few-shot), achieving very high I-AUROC even with only 8 support images per class (98.7% on MVTec-AD; 97.4% on VisA) (Guo et al., 20 Oct 2025).
Cross-domain adaptation: NexViTAD fuses DINOv2 with Hiera and uses shared subspace bottlenecks and MTL decoders for robust adaptation to domain shifts, crucial in industrial applications where data distributions may drift across sites or products. Sinkhorn-K-means further improves anomaly region segmentation. On MVTec-AD, NexViTAD attains few-shot AUC of 97.5%, AP of 70.4%, and PRO of 95.2%, surpassing previous state of the art (Mu et al., 10 Jul 2025).

These findings support that vision foundation models, especially DINOv2, form a highly effective substrate for few-shot anomaly detection: adaptable, performant, and, with appropriate calibration and defense, viable for critical and cross-domain settings.