ADBench: Anomaly Detection Benchmark

Updated 10 April 2026

The paper presents a comprehensive AD benchmark that unifies 30 detection algorithms over 57 datasets, providing a fair and extensible evaluation framework.
It employs over 98,000 controlled experiments across tabular, vision, and NLP tasks to disentangle supervision effects, anomaly types, and noise impacts using robust metrics.
Empirical results reveal that while semi-supervised methods quickly improve with minimal labels, no single algorithm consistently outperforms others, urging data-driven algorithm selection.

ADBench is a comprehensive anomaly detection (AD) benchmark introduced to address longstanding challenges in the empirical evaluation of anomaly detection algorithms across varying data modalities, anomaly types, and supervision regimes. Its design encompasses tabular, vision, and language tasks and incorporates a wide spectrum of algorithms—ranging from classical unsupervised methods to modern deep neural networks and tree-ensemble approaches. By executing over 98,000 controlled experiments across 57 datasets, ADBench establishes a unified and extensible framework for fair, statistically grounded comparison and enables rigorous analysis of algorithmic design choices, sensitivity to data corruptions, and the effects of supervision on anomaly detection performance (Han et al., 2022).

1. Motivation and Scope

Prior to ADBench, the anomaly detection research landscape suffered from four critical deficiencies: limited supervision regimes (with an overemphasis on unsupervised protocols), lack of taxonomy distinguishing anomaly types, almost no systematic robustness analysis under noise or corruptions, and a scarcity of broad benchmarks integrating high-dimensional non-tabular domains such as computer vision and natural language. ADBench directly addresses these gaps by:

Unifying 30 state-of-the-art anomaly detection algorithms—spanning unsupervised, semi-supervised, and supervised paradigms.
Structuring experiments across 57 diverse datasets, including 47 structured tabular data tasks and 10 computer vision/NLP datasets standardized as fixed-size embeddings.
Designing a protocol that explicitly disentangles the effects of supervision, anomaly type, and data corruption, with scalable, reproducible code and analytic tools (Han et al., 2022).

2. Benchmark Design and Dataset Coverage

ADBench’s algorithmic coverage comprises four main families:

Density- and distance-based: LOF, kNN, HBOS, CBLOF, COF, SOD, LODA, IsolationForest, COPOD, ECOD.
Distribution- and reconstruction-based: PCA, OCSVM, DAGMM, DeepSVDD, GANomaly, ALAD.
Semi-supervised: DeepSAD, DevNet, REPEN, PReNet, FEAWAD, GANomaly (weak), XGBOD.
Supervised classifiers: Naive Bayes, SVM, Random Forest, XGBoost, LightGBM, CatBoost, ResNet (tabular), FTTransformer.

The datasets span:

Tabular: classical AD benchmarks (n = 80–619,000 samples; d = 3–1,555 features; anomaly rates 0.03–39.9%, median 5%).
Vision: CIFAR-10, FashionMNIST, MNIST-C, SVHN, MVTec-AD, all embedded via ResNet18.
NLP: AGNews, Amazon, IMDB, Yelp, 20Newsgroups, embedded via BERT [CLS] vectors.

All non-tabular datasets are standardized with semantically meaningful anomaly splits and consistent 5% prevalence in multi-class settings for comparability (Han et al., 2022).

3. Experimental Protocol and Evaluation Metrics

Experiments operate under an inductive 70/30 train/test split, stratified by anomaly label, and are repeated over 3 random seeds. Three main supervision regimes are considered:

Unsupervised: No labeled anomalies; model trains only on normal data.
Semi-supervised: A fraction $\gamma_l$ of anomalies are labeled (typical values: 1–5%).
Supervised: All anomalies labeled, enabling full binary classification.

ADBench’s anomaly type taxonomy includes:

Local anomalies (deviations in dense normal regions).
Global anomalies (outliers outside the convex hull of normal data).
Dependency anomalies (violate feature dependency structure).
Clustered anomalies (tight clusters structurally distinct from normal).

Threshold-independent metrics are used:

ROC AUC: Area under the receiver-operator characteristic.
PR AUC: Area under the precision-recall curve (robust to class imbalance).
Wilcoxon signed-rank tests and Critical Difference (CD) diagrams implement robust statistical comparisons (Han et al., 2022).

4. Key Empirical Results and Analysis

Overall Performance

No unsupervised algorithm is statistically superior across all tasks; IForest, COPOD, and ECOD are consistently performant but not dominant.
Semi-supervised methods (e.g., DeepSAD, DevNet, XGBOD) rapidly surpass unsupervised detectors with as little as 1% labeled anomalies (median ROC AUC from 71% to 75%).
Supervised models (ensemble trees, FTTransformer) require ≈10% anomaly labels to match unsupervised baselines.
Deep neural nets (DeepSVDD, DAGMM) underperform in several tabular benchmarks compared to simpler algorithms.

Sensitivity to Anomaly Type

Anomaly Type	Top Unsupervised Method (ROC AUC)
Local	LOF
Global	kNN
Dependency	kNN
Clustered	OCSVM

A significant finding is that semi-/fully supervised models often fail to outperform unsupervised baselines (LOF, kNN, OCSVM) even with up to 50% label availability, except for clustered anomalies.

Robustness to Noise and Corruption

Duplicated anomalies reduce unsupervised AUC by ≈16%; semi-/fully supervised models are robust to this shift.
Adding up to 50% irrelevant features degrades unsupervised/semi-supervised AUC by ≈10%, but supervised tree ensembles experience <5% loss due to intrinsic feature selection.
Up to 50% label flipping minimally affects fully supervised models for low noise levels (<10% flips, Δ ≈ −2% ROC AUC); unsupervised methods are invariant.

PR AUC can paradoxically increase under duplicated anomalies, while ROC AUC robustly penalizes such shifts (Han et al., 2022).

5. Critical Methodological Insights

Choice of anomaly detection algorithm should be data-driven and matched to expected anomaly types.
Semi-supervised models excel when a very limited fraction (≤5%) of anomaly labels are available, but must integrate unsupervised prior structure in the presence of heterogeneity.
Ensemble techniques (tree-based, stacking) and transformer-based tabular encoders (FTTransformer) are compelling on tabular AD, though their robustness warrants further study.
Feature-selection modules and label-informed scoring mitigate impact from irrelevant dimensions and anomaly masking, respectively.

The “no free lunch” phenomenon is evident: there is no single method universally superior across all anomaly types or domains (Han et al., 2022).

6. Reproducibility and Extensibility

ADBench is fully open source (BSD-2), supplying code, default hyperparameters, and all results, enabling immediate integration of new algorithms and datasets. The platform’s extensibility supports:

Plug-in of new AD models and data sources.
Generation of critical difference diagrams for rapid method comparison.
Consistent application of the experimental protocol to alternative or future model variants.

ADBench’s structure also facilitates benchmarking of generative models (e.g., diffusion-based), time-series, and graph-structured data as research domains expand (Han et al., 2022).

7. Impact, Limitations, and Future Directions

ADBench establishes a new empirical standard for holistically evaluating AD algorithms. It demonstrates the practical gains from semi-supervision, the heterogeneity of anomaly type difficulty, and the necessity of data-driven algorithm selection. Open challenges remain: automated hyperparameter search, truly anomaly-type-aware semi-supervised learning, robust self-supervised representation learning for tabular AD, and domain adaptation with synthetic or generative outlier models.

Anticipated extensions include mixed anomaly-type datasets, variable contamination scenarios, graph/time-series/vision benchmarks with direct generative modeling (e.g., via diffusion), and tasks targeting fairness, interpretability, and drug discovery. ADBench thus forms a cornerstone infrastructure for rigorous, forward-looking anomaly detection research (Han et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

ADBench: Anomaly Detection Benchmark (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ADBench: Anomaly Detection Benchmark.