Class-Generalizable Anomaly Detection

Updated 3 February 2026

Class-generalizable anomaly detection is a framework that identifies anomalies across various object classes without requiring separate models or labels.
It leverages techniques like self-induction transformers, residual feature learning, and vision-language prompting to normalize diverse feature distributions.
Evaluation involves cross-domain benchmarks with metrics such as AUROC and calibration analysis to ensure reliable performance on unseen classes.

Class-generalizable anomaly detection refers to methods that enable a single anomaly detection model to function robustly across multiple object categories or data domains—often without per-class retraining, fine-tuning, or category labels—even when encountering previously unseen object classes. The main challenge is that the feature distributions of “normal” data vary dramatically from class to class, which typically prevents classic one-class or per-class anomaly detectors from transferring effectively. Recent advances have produced purpose-built architectures, calibration strategies, and theoretical frameworks to achieve broad generalization, motivated by applications in industrial inspection, cybersecurity, open-set recognition, and cross-domain deployment.

1. Problem Formulation and Theoretical Challenges

Class-generalizable anomaly detection extends the classical setting by relaxing the assumption that a model will only be queried on test data belonging to the same class or distribution as the training “normal” data. Formalizations differ by setting:

Multi-category unsupervised AD: Given a dataset $\mathcal{D}_{\mathrm{train}}$ consisting only of normal samples from $C$ object categories, learn an anomaly scoring function $s(x)$ that is sensitive to deviations in any of these categories, using one unified model without per-class heads or fine-tuning (Yao et al., 2022, Guo et al., 2024).
Open-set generalization: A scoring function $s(x)$ should yield low values for normal samples from multiple known classes and high values for anomalies, even when anomalies come from previously unseen distributions or classes (Bergman et al., 2020, Singh et al., 2021).
Few-/zero-shot transfer: Methods such as residual-learning and dictionary-lookup ensure that, provided only a handful of normal instances from an unseen class, the anomaly detection mechanism will extend to that class with minimal or no adaptation (Yao et al., 2024, Qu et al., 19 Aug 2025, Zhu et al., 2024).

The core difficulty is aligning, normalizing, or regularizing the anomaly score distributions so that the same detection threshold (or calibration) is effective across all plausible class or domain scenarios. Without such mechanisms, the model’s false-positive/false-negative rates become class-dependent and calibration becomes intractable.

2. Architectural Methods for Generalization Across Classes

Wide diversity exists in architectures designed or adapted for class-generalizability, each addressing the “normal class drift” problem differently.

Self-Induction Transformers: The Self-Induction Vision Transformer (SIVT) first extracts multi-scale property descriptors using a frozen CNN, then uses a transformer encoder/decoder to reconstruct these features via a self-induction mechanism (induction tokens only attend to random subsets of the inputs). This compels the model to learn globally coherent representations that are not shortcut-able and thus generalize to new categories (Yao et al., 2022).
Residual Feature Learning: Models such as ResAD generate residual features for each local descriptor by subtracting the nearest feature from a small pool of normal exemplars (typically few-shot), followed by hypersphere constraining and density estimation via normalizing flows. This process normalizes out class-specific variation and yields a nearly universal normal feature distribution that extends to unseen classes (Yao et al., 2024).
Vision-Language Prompting and Dictionary Lookup: Frameworks such as DictAS and InCTRL leverage frozen CLIP visual encoders to extract patch-level or image-level representations conditioned on few-shot normal prompts from the target domain, computing anomaly scores as distance or residuals to that prompt set. DictAS further employs dictionary lookup and sparse probability modules for patch-wise anomaly segmentation without need for retraining on new classes (Qu et al., 19 Aug 2025, Zhu et al., 2024).
Dual-Branch Knowledge Distillation Models: Generalist frameworks fuse outputs from (i) a patch-based encoder-decoder (local, industrial bias) and (ii) a global encoder-encoder (semantic/generalist bias), each distilled from a powerful pre-trained vision transformer. Their outputs are fused using a Noisy-OR objective, creating a model robust to both subtle and global anomalies across domains (Park et al., 29 Sep 2025).
Class-Agnostic Distribution Alignment: CADA post-processes the output of nearly any AD model by regressing (from backbone features) to per-image calibration statistics (mean, range of normal scores), aligning the anomaly-score distribution across latent classes without knowledge of class labels. This enables one fixed threshold for defect detection on all classes and subclasses (Guo et al., 2024).
Meta-Learning Strategies: Episodic meta-learning treats different sets of classes alternately as "normal" vs. "anomaly," enabling the model to generalize the decision margin to unseen classes. Bilevel optimization tunes the representation to be compact for normal samples and sharp for known (meta) anomalies, while holding margins for held-out classes during test (Roy et al., 27 Jan 2026).
Mixture-of-Experts and Boundary-Driven Approaches: ABounD blends MoE-learned general semantic priors with class-specific cues into adaptive prompts, and sculpts a robust boundary using adversarial feature crafting and boundary losses. The result is high accuracy with only few-shot normal samples and without per-category fine-tuning (Deng et al., 27 Nov 2025).

3. Evaluation Protocols, Benchmarks, and Metrics

Benchmarking class-generalizable anomaly detectors requires a diverse suite of datasets and metrics sensitive to both cross-class and cross-domain transfer:

Datasets: Coverage spans industrial (e.g., MVTec AD, VisA, MVTec LOCO, Real-IAD), medical (e.g., Uni-Medical, BrainMRI), semantic (e.g., CIFAR-10/100, Fashion-MNIST), and open-set or domain-generalization datasets (Zhang et al., 2024, Bühler et al., 2024).
Evaluation Paradigms: Experiments evaluate models in single-class, multi-class, and few-shot (including leave-one-class/domain-out) scenarios, often reporting both image-level and pixel-level performance.
Core metrics: Image-level and pixel-level AUROC/mAP, region-level AU-PRO, max-F1, maximal IoU, mean Anomaly Detection (mAD), all averaged across classes and domains (Zhang et al., 2024).
Calibration analysis: Emphasis is placed on the calibration of anomaly thresholds and the possibility (or impossibility) of using a single threshold across classes or domains.
Computational reporting: ADEval provides GPU-accelerated metric computation, enabling reliable assessment even on large-scale datasets (Zhang et al., 2024).

4. Empirical Results, Ablations, and Comparative Analyses

Empirical studies consistently demonstrate that explicit mechanisms for class-alignment, residual normalization, or adaptive margin formation are required for successful generalization:

Method/Setting	Image-Level AUROC ↑	Pixel AUROC ↑	Remarks	Reference
SIVT (MVTec-AD-all)	96.4%	96.9%	<0.5% perf. drop from per-class mode	(Yao et al., 2022)
DeepMAD (CIFAR-10)	0.63–0.65	—	8–15 pts > one-class ensembles for K up to 10	(Singh et al., 2021)
CADA (UniAD+RevDist)	98.6/98.0	97.2/95.6	Absolute-unified (no class labels, one threshold)	(Guo et al., 2024)
ResAD (VisA→MVTEC)	88.0%	96.3%	4-shot, no retraining, outperforms WinCLIP/PromptAD	(Yao et al., 2024)
DictAS (Ind+Med)	98.4/97.4	(pixel)	Outperforms dict/prompt/CLIP rivals for FSAS	(Qu et al., 19 Aug 2025)
Dual-KD (MVTec-AD)	99.7%	—	Outperforms GeneralAD, Dinomaly multi-class settings	(Park et al., 29 Sep 2025)
ABounD (MVTec-AD)	94.8%	96.2% (1-shot)	Best FSAD, matches full-shot HVQ-Trans w/ 32 shots	(Deng et al., 27 Nov 2025)
SEMLP (MVTecDG)	87.2%	—	Novel/held-out domain, best on domain-generalization	(Bühler et al., 2024)

Ablation studies reveal that class-agnostic or residual-centric adaptations yield significantly higher cross-domain AUROC, and removing residual constraints, adaptive calibration, or meta-learning steps leads to large drops in out-of-class and out-of-domain performance (Yao et al., 2024, Roy et al., 27 Jan 2026). Multi-branch distillation (patch and semantic) is essential for unified coverage of both fine-grained and semantic anomalies (Park et al., 29 Sep 2025).

5. Regularization, Calibration, and Domain Alignment Mechanisms

Generalization is predicated on controlling intra- and inter-class variation in feature and anomaly score distributions:

Residual-based normalization (Yao et al., 2024, Zhu et al., 2024, Qu et al., 19 Aug 2025): Nearest-neighbor or sparse dictionary matching of normal patch features is used to transform input features into distributions largely invariant to object class.
Distribution alignment (Guo et al., 2024): Explicit prediction and alignment of reference distribution statistics (mean, range) of anomaly scores for every input image, enabling the construction of a unified calibration scale without class labels.
Margin-based and meta-learning frameworks (Singh et al., 2021, Roy et al., 27 Jan 2026): Loss formulations that maximize intra-class compactness and inter-class separation, or directly tune the margin between in-distribution and out-of-distribution confidence in a task-episodic meta-objective.
Prompt adaptation and mixture-of-experts (Deng et al., 27 Nov 2025): Fusing general semantic priors with class-adaptive features via routing and gating, robust even under few-shot or domain-holdout conditions.
Dual-branch Noisy-OR (Park et al., 29 Sep 2025): Orthogonalization via heterogeneous students and output fusion, preventing any single detection pathway from dominating in domain-specialist or generalist settings.

6. Limitations, Open Challenges, and Future Directions

Despite recent progress, several unresolved issues remain:

Domain shift and cross-modality transfer: Most frameworks assume access to some normal samples from new domains; performance for true zero-shot generalization, especially across modalities (e.g., industrial to medical), remains unsolved (Yao et al., 2022, Yao et al., 2024).
Calibration without class labels: While methods such as CADA enable absolute-unified thresholds, there is evidence that outlier class distributions may resist full alignment under heavy texture or distribution shift (Guo et al., 2024).
Dependence on prompt or reference selection: Residual-based and prompt-conditioned models rely critically on the diversity and representativity of provided few-shot normal exemplars. Poor coverage can result in higher false-positive rates (Zhu et al., 2024, Qu et al., 19 Aug 2025).
Data, architecture, and metric scale: Most benchmarks, even recent ones (ADer (Zhang et al., 2024)), are dominated by industrial visual AD datasets; large-scale natural-scene and medical benchmarks remain underexplored. Memory and compute requirements for some memory-bank or meta-learning methods can impede scaling (Zhang et al., 2024, Bühler et al., 2024).
Theoretical optimality and margin selection: Optimal selection of calibration or margin hyperparameters (e.g., in DeepMAD, ResAD) often lacks principled backing and may be domain-sensitive (Singh et al., 2021, Roy et al., 27 Jan 2026).

A promising future direction is the development of unified, plug-and-play calibration layers applicable across AD paradigms, ongoing research in robust instance selection for reference-based residual learning, and the design of larger, more challenging, and domain-diverse benchmarks explicitly for class-generalizable anomaly detection (Zhang et al., 2024).

7. Comparative Summary Table

Model	Approach	Domain Generalization	Calibration	Main Reference
SIVT	Self-induction ViT	Multiclass, no heads	Score std/max	(Yao et al., 2022)
DeepMAD	Margin-based Encoder	One-vs-rest, compact	Margin/center dist	(Singh et al., 2021)
GOAD	Geometric centers/tasks	Open-set, affine	Margin, uniformity	(Bergman et al., 2020)
CADA	Score dist. regressor	Class-free, plug-in	Per-image norm.	(Guo et al., 2024)
SEMLP	Patch-MLP (domain-gene)	Leave-domain-out	MLP calibration	(Bühler et al., 2024)
ResAD	Residual, flows	Cross-class, few-shot	Normalizing flow	(Yao et al., 2024)
DictAS	Dict lookup (CLIP)	Few-shot, segmentation	Sparse retrieval	(Qu et al., 19 Aug 2025)
InCTRL	CLIP residuals, prompts	Generalist, few-shot	Patch/image fusion	(Zhu et al., 2024)
Generalist Dual-KD	Distill/Noisy-OR	All domains, tasks	Noisy-OR fusion	(Park et al., 29 Sep 2025)
ABounD	Prompt, PGD boundary	Few-shot, adaptive	MoE boundary loss	(Deng et al., 27 Nov 2025)

These representative methods underline the breadth of architectural strategies and regularization techniques currently researched for class-generalizable anomaly detection.

Class-generalizable anomaly detection remains a rapidly evolving field at the intersection of unsupervised learning, meta-learning, domain adaptation, and open-set recognition. The balance between architectural simplicity, statistical calibration, and semantic alignment is at the core of generalization success, with practical deployment driven by the ability to align detection performance across diverse and previously unseen categories or domains. For further detail, see the referenced primary literature (Yao et al., 2022, Bergman et al., 2020, Singh et al., 2021, Guo et al., 2024, Zhu et al., 2024, Yao et al., 2024, Qu et al., 19 Aug 2025, Deng et al., 27 Nov 2025, Park et al., 29 Sep 2025, Roy et al., 27 Jan 2026, Zhang et al., 2024, Bühler et al., 2024).