Zero-Shot Anomaly Classification & Segmentation

Updated 4 December 2025

Zero-shot AC/AS is a technique for automatically identifying and segmenting anomalies in images and 3D data without using any target-specific labeled examples.
It leverages both batch-based mutual scoring and prompt-driven vision-language models to compute per-image and per-pixel anomaly scores for robust defect detection.
Practical applications include industrial inspection and medical imaging, achieving high AUROC and segmentation performance even in open-world scenarios.

Zero-shot Anomaly Classification and Segmentation (AC/AS) refers to the automatic identification and delineation of defects or anomalies in data (typically images, but also 3D volumes or multimodal inputs) without access to any labeled anomaly samples from the target distribution. In the zero-shot paradigm, models must generalize their notion of abnormality from either unlabeled test batches or from auxiliary datasets (often with prompt- or language-derived supervision), precluding any target-specific anomaly fine-tuning. This has become the dominant formulation for industrial inspection, open-world perception, and medical imaging, where anomaly distributions are vast, shifting, or costly to annotate.

1. Core Principles and Problem Formalization

In zero-shot AC/AS, the model is given either (a) only the unlabeled images and asked to predict per-image (classification) and per-pixel (segmentation) anomalies, or (b) auxiliary data from unrelated domains/classes to support prompt learning or feature alignment. No anomaly data from the test domain is available for training or threshold selection (Chen et al., 2023, Reichard et al., 1 Dec 2025, Zhou et al., 2023, Jeong et al., 2023, Li et al., 30 Jan 2024, Li et al., 13 Nov 2025, Le-Gia et al., 12 Oct 2025, Le-Gia, 2 Dec 2025).

The standard mathematical formalization is as follows:

Let $\mathcal{D} = \{I_1, \dots, I_N\}$ denote a batch of test images, each tokenizable into $M$ patches/volumetric tokens with features $\mathbf{z}_i^h \in \mathbb{R}^D$ .
The anomaly scoring function $a : \mathbb{R}^D \to \mathbb{R}$ is defined so that higher $a(\mathbf{z})$ indicates greater anomaly.
Image-level classification: $A(I_i) = \max_h a(\mathbf{z}_i^h)$ ; decision threshold $\tau$ yields binary label (Le-Gia, 2 Dec 2025).
Anomaly segmentation: apply $a(\mathbf{z}_i^h)$ map over all $h$ , producing a heatmap or binary mask after thresholding.

Two main paradigms exist:

Batch-based/representation-driven methods: assign anomaly scores by mutual comparison of patch features within the batch, based on the empirical rarity of their representations (Li et al., 30 Jan 2024, Le-Gia, 2 Dec 2025, Li et al., 13 Nov 2025).
Auxiliary-trained/prompt-driven methods: align learned visual/textual representations via prompt engineering or prompt learning on auxiliary data, enabling anomaly/normality discrimination in joint embedding space (Chen et al., 2023, Zhou et al., 2023, Jeong et al., 2023, Hou et al., 30 Jun 2025, Chen et al., 5 Aug 2025, Sadikaj et al., 9 Apr 2025).

The zero-shot regime is further complicated by the phenomenon of consistent anomalies: anomalous patterns that recur frequently, causing naive rarity-based scoring to fail and prompting the need for advanced graph, community-detection, or joint prompt-masked learning (Le-Gia, 2 Dec 2025, Le-Gia et al., 12 Oct 2025).

2. Representation-based and Mutual Scoring Approaches

Batch-based zero-shot AC/AS approaches assign anomaly scores entirely within the unlabeled test batch, eschewing any prompts or language structure:

Mutual Scoring Mechanism (MSM): For each patch, compute its minimum feature distance to patches in all other images. The anomaly score is typically the mean minimal distance to the $K$ -nearest images ( $K=30\%$ by default) (Li et al., 30 Jan 2024). Multi-scale feature aggregation (LNAMD/SNAMD) across different receptive fields and transformer layers improves detection of both small and large defects (Li et al., 30 Jan 2024, Li et al., 13 Nov 2025).
Constraint-based Re-scoring: To reduce false positives, especially from noisy isolated patches in otherwise normal images, a graph-smoothing procedure over image-level similarity graphs is employed (Li et al., 30 Jan 2024). Nodes represent images, edges reflect dot-product similarity of their global (CLS/token) embeddings, and final scores are refined via local neighbor averaging.
Multimodal/3D Extensions: Recent advances (MuSc-V2) extend all representation-based operations to work with both 2D (image) and 3D (point cloud or volumetric) modalities, using curvature-aware neighborhood pooling in 3D and similarity-weighted pooling across multiple scales in both 2D and 3D (Li et al., 13 Nov 2025).

Batch-based approaches are fully training-free, require no auxiliary dataset or normal/defect labeling, and achieve SOTA metrics especially on out-of-distribution industrial datasets:

MVTec AD: AUROC $_{\text{cls}}$ 97.8%, F1 $_{\text{seg}}$ 62.6%, PRO 93.8% (Li et al., 30 Jan 2024); MVTec 3D-AD: AP-seg 54.7% (+23.7 pp over prior SOTA) (Li et al., 13 Nov 2025).

3. Prompt-Driven, Language- and Vision-LLM-Based Methods

Methods built on vision-LLMs (VLM) such as CLIP leverage semantic alignment between images and text to enable flexible anomaly reasoning:

Prompt Engineering and Learning: Prompt sets (state words × templates) are crafted to represent generic “normal” and “abnormal” states (“flawless [object]”, “damaged [object]”) (Jeong et al., 2023, Chen et al., 2023). Compositional prompt ensembles (WinCLIP) and stacked/clustering-driven prompts (StackCLIP) offer improved generalizability and stability (Hou et al., 30 Jun 2025).
Object-Agnostic Prompt Learning: AnomalyCLIP learns two short prompt vectors (“[object]”, “damaged [object]”) optimized on auxiliary domains, avoiding overfitting to specific seen classes and generalizing normal/abnormal concepts (Zhou et al., 2023).
Conditional / Dynamic Prompt Synthesis: CoPS synthesizes prompts conditioned on image/patch features and class semantics, injecting prototypes and sampled VAE tokens into prompt structure, enabling highly adaptive state modeling (Chen et al., 5 Aug 2025).
Multi-Type and Open-Vocabulary Segmentation: MultiADS extends zero-shot AS to multi-type segmentation, generating specific masks (and class probabilities) for each defect type using defect-aware textual prompts and layered similarity heads (Sadikaj et al., 9 Apr 2025). In open-world tasks (Clipomaly), VLMs dynamically expand segmentation vocabularies at inference by discovering unknown regions and assigning interpretable names via CLIP-based region matching (Reichard et al., 1 Dec 2025).
Few-Shot Memory Extensions: APRIL-GAN and WinCLIP+ store features from a small set of normal reference images and compare against them at inference, significantly boosting performance on logical defects and hard-to-prompt anomalies (Chen et al., 2023, Jeong et al., 2023).

In all cases, classification and segmentation scores are derived from similarity logits between learned/prompted text embeddings and spatial (patch) or global (CLS) image features, typically normalized via softmax. The best-performing approaches fuse features across several transformer depths/stages.

Representative results:

AnomalyCLIP: Industrial dataset average Img-AUROC 91.5%, Pix-AUROC 91.1% (Zhou et al., 2023).
WinCLIP (ViT-B/16+): MVTec-AD (zero-shot) AC AUROC 91.8%, AS pixel-AUROC 85.1%; + few-shot (WinCLIP+) AUROC 95.2%, pixel-AUROC 96.2% (Jeong et al., 2023).
StackCLIP: MVTec-AD zero-shot AS AUPRO 86.4%, F1-max 47.6%, AUROC 91.7% (Hou et al., 30 Jun 2025).

4. Dealing with Consistent Anomalies: Graph-Based and Theoretical Advances

Classic representation methods fail when faced with consistent anomalies—recurrently appearing defects that are no longer rare in batch statistics. This scenario is theoretically and algorithmically formalized as follows:

Neighbor-Burnout Phenomenon: Consistent anomalies exhibit abruptly increasing growth rates in sorted batchwise distances (log-gradients spike at the burnout index $i=H$ , the number of repeated anomaly images) (Le-Gia, 2 Dec 2025, Le-Gia et al., 12 Oct 2025). For normal patches, this growth rate $\tau_i$ decays smoothly as $i^{-\alpha}$ (power-law governed by EVT).
CoDeGraph Algorithm: Constructs an image-level graph where nodes with shared suspicious low-distance patterns (weighted endurance ratio $\zeta'$ ) are joined, and community detection (Leiden+CPM) isolates densely co-matching clusters. Patches in detected anomaly communities are selectively filtered based on their dependency ratio (see $r(p)$ ), yielding a cleaned batch for robust mutual scoring (Le-Gia, 2 Dec 2025, Le-Gia et al., 12 Oct 2025).
Theoretical Foundation: The statistical signature is grounded in extreme value theory (Fréchet/Beta/Exp distributions of sorted minimal distances), predicting (and separating) the scaling patterns of normal vs. consistent-anomaly patch neighbor relationships (Le-Gia, 2 Dec 2025).
Batch-Text Bridging: CoDeGraph-generated pseudo-masks can supervise prompt-driven models like APRIL-GAN or AnomalyCLIP, closing the domain adaptation gap and reducing reliance on labeled anomaly masks (Le-Gia, 2 Dec 2025).

Empirically, CoDeGraph achieves robust zero-shot performance, notably excelling on consistent-anomaly-heavy datasets:

MVTec-CA: AC AUROC 98.5%, AS F1 73.8%, AP 77.2% (Le-Gia et al., 12 Oct 2025).

5. Practical Implementations, Robustness, and Evaluation

Zero-shot AC/AS methods are evaluated primarily on industrial (MVTec AD, VisA, MPDD, BTAD), medical (HeadCT, BrainMRI, ISIC, etc.), and open-world datasets (SMIYC, RoadAnomaly, Cityscapes anomaly tracks) (Li et al., 30 Jan 2024, Zhou et al., 2023, Hou et al., 30 Jun 2025, Reichard et al., 1 Dec 2025).

Key implementation attributes:

Training vs. Training-Free: Batch-based mutual scoring and Maskomaly are fully training-free; prompt-based models require auxiliary dataset fine-tuning or pseudo-mask bootstrapping.
Inference and Complexity: Mutual scoring and graph-based models require $O(N^2)$ memory/computation; subset partitioning and attention-layer optimization mitigate costs.
Thresholding: Optimal segmentation/classification thresholds are typically chosen per-dataset or via validation, although metrics such as Maximal Detection Margin (Maskomaly) gauge robustness to gradual threshold shifts (Ackermann et al., 2023).
Modalities: MuSc-V2 and CoDeGraph now support 3D volumetric and multimodal data using ViT and point cloud features; volumetric tokenization achieves 99.96% AUROC on BraTS-METS MRI, Dice 75.4% (Le-Gia, 2 Dec 2025, Li et al., 13 Nov 2025).

Performance highlights (all zero-shot unless noted):

Method	Testbed	AUROC (Cls)	F1-max (Seg)	Comments
MuSc	MVTec AD	97.8	62.6	Training-free, batch-based
AnomalyCLIP	Industrial	91.5	91.1	Object-agnostic prompt learning
CoDeGraph	MVTec-CA	98.5	73.8	Robust to consistent anomalies
StackCLIP	MVTec-AD/VisA	91.7/84.7	47.6/—	Stacked prompts, cluster-specific heads
MultiADS	VisA (Zero)	83.6	89.7	Multi-type, per-defect class segmentation
FiSeCLIP	MVTec-AD	95.3	49.1	Mutual filtering, batch and prompt fusion
APRIL-GAN (few-shot)	VAND	0.8687 (F1)	~0.43 (F1)	1st in K-shot classification, state of art

6. Ablations, Challenges, and Limitations

Ablation studies across methods have shown:

Multi-stage fusion is essential; combining shallow and deep features typically boosts segmentation performance by >10 points (Chen et al., 2023).
Prompt design (state word × template ensembles, stacked category names, or VAE-generated class tokens) critically affects both stability and peak accuracy (Hou et al., 30 Jun 2025, Chen et al., 5 Aug 2025, Zhou et al., 2023).
Consistent anomaly filtering must be precise; naive top-k removal can degrade performance for classes where anomalies are not dominant (Le-Gia et al., 12 Oct 2025).
Mutual scoring and cross-modal fusion (as in MuSc-V2) add up to 20+ percentage points in AP over previous best for 3D-AD and Eyecandies (Li et al., 13 Nov 2025).

Common limitations include:

Failure to detect anomalies that are common in-batch (“consistency” bias); statistical methods can be circumvented if anomalies dominate numerically (Le-Gia et al., 12 Oct 2025, Le-Gia, 2 Dec 2025).
Lack of fine-grained segmentation at the resolution limit of ViT backbones.
Auxiliary dataset choice (for prompt learning) can introduce domain bias.
Sensitivity to batch composition (for batch-based methods) if the batch is not diverse.
Manual threshold tuning for robust application in deployment.
Maskomaly’s region-level AC must be heuristically aggregated (Ackermann et al., 2023).

7. Emerging Trends and Future Directions

Hybridization of paradigms: Bridging batch-based and VLM paradigms via pseudo-masks and co-training (CoDeGraph-supervised VLMs) (Le-Gia, 2 Dec 2025).
3D and Multimodal Generalization: Expansion to 3D industrial and medical datasets, with surface-aware patch grouping and synchronized patch-pooling/fusion (Li et al., 13 Nov 2025, Le-Gia, 2 Dec 2025).
Open-Vocabulary and Human-Interpretable Anomalies: Labeling and segmenting unknown classes with structured language integration (Clipomaly), dynamic vocabulary expansion, and region-level tag assignment (Reichard et al., 1 Dec 2025).
Multi-type and Open-World Anomaly Detection: Simultaneous segmentation and precise defect-type discrimination for per-object, per-defect action (Sadikaj et al., 9 Apr 2025).
Graph-theoretic and EVT-Grounded Filtering: Community detection, endurance-ratio filtering, and statistical modeling peripheral to batchwise similarity scaling (Le-Gia, 2 Dec 2025, Le-Gia et al., 12 Oct 2025).
Efficiency and Robustness: Algorithmic optimizations to reduce $O(N^2)$ time and memory, including batch partitioning, CLS-guided screening, and constrained neighbor re-ranking (Le-Gia, 2 Dec 2025, Li et al., 13 Nov 2025).
Integration with segmentation transformers: Post-processing mask networks (Mask2Former) for anomaly segmentation with minimal engineering (Maskomaly) (Ackermann et al., 2023).

In the current landscape, zero-shot AC/AS methods have established both theoretical and practical foundations for anomaly detection in highly variable, open, and data-scarce environments. Their continued evolution toward robust, multiscale, multimodal, open-world, and explainable systems remains a central research focus.