ADBench: Benchmark for Anomaly Detection

Updated 10 February 2026

ADBench is an open-source benchmarking framework that evaluates anomaly detection algorithms across diverse domains such as tabular, computer vision, and natural language.
It consolidates 57 real-world datasets and 30 algorithms under unsupervised, semi-supervised, and supervised settings to ensure reproducible and fair comparisons.
The framework employs robust evaluation metrics and rigorous protocols, providing actionable insights into algorithm performance under various anomaly ratios and data corruptions.

ADBench

ADBench is an open-source, large-scale benchmarking framework for evaluating anomaly detection (AD) algorithms across varied domains, data modalities, supervision regimes, and anomaly types. It is widely adopted as the primary benchmark suite for unsupervised, semi-supervised, and supervised outlier detection, and serves as a foundation for principled, reproducible comparison in algorithm selection and AD research (Han et al., 2022).

1. Dataset Composition and Scope

ADBench brings together 57 real-world datasets spanning four principal domains:

Tabular (Classic): 47 datasets from network security, fraud detection, healthcare, census, manufacturing, and standard UCI benchmarks. Dataset sizes range from hundreds to over 600,000 samples, with dimensionalities from 3 to 500 and anomaly ratios (ρ) from 0.03% to ~40%. Examples: ALOI, annthyroid, backdoor, census, fraud, magic.gamma, mammography, PageBlocks, satellite, skin, SpamBase, thyroid, Wilt.
Computer Vision (Pre-extracted Embeddings): 7 datasets where image data are represented by precomputed embeddings using e.g. ResNet-18 or similar. Examples: CIFAR-10, FashionMNIST, MNIST-C, MVTec-AD, SVHN, CelebA, InternetAds.
Natural Language (BERT Embeddings): 5 datasets, all with 768-dimensional BERT features: AgNews, Amazon, IMDb, Yelp, 20Newsgroups.
Others: Domains such as speech, advertising, and others.

For each dataset, the benchmark specifies $n$ (sample count), $d$ (feature dimensionality), and $\rho$ (anomaly ratio), with all splits and preprocessing scripts provided for consistency (Han et al., 2022).

2. Algorithm Suite and Supervision Regimes

ADBench evaluates 30 algorithms, encompassing the following categories:

Unsupervised Methods (14): Principal Component Analysis (PCA-reconstruction), One-Class SVM (RBF kernel), Local Outlier Factor (LOF), Cluster-Based LOF (CBLOF), Chaining Outlier Factor (COF), Histogram-based Outlier Score (HBOS), k-Nearest Neighbors (kNN), Subspace Outlier Detection (SOD), Copula-based Outlier Detection (COPOD), Empirical CDF Outlier Detection (ECOD), DeepSVDD, Deep Autoencoding Gaussian Mixture Model (DAGMM), LODA, Isolation Forest.
Semi-Supervised Methods (7): GANomaly, DeepSAD, REPEN, DevNet, PReNet, FEAWAD, XGBOD.
Supervised Classifiers (9): Naïve Bayes, SVM, MLP, Random Forest, XGBoost, LightGBM, CatBoost, ResNet (tabular-adapted), FTTransformer.

All methods are benchmarked using code bases such as PyOD, scikit-learn, and custom implementations. Hyperparameter tuning is minimal; most methods use published defaults to ensure fairness and reproducibility (Han et al., 2022).

3. Experimental Design and Protocols

ADBench adopts a rigorous inductive evaluation protocol:

Splitting: Each dataset is stratified into 70% train and 30% test sets, repeated three times. For purely unsupervised methods, both normal and anomalous samples are present at train/test; for semi-supervised settings, varied proportions of labeled anomalies are included; for supervised, all labeled data are used for fitting.
Noisy/Corrupted Data: Three robustness axes are assessed: (1) Duplicated anomalies (train and test anomalies replicated up to 6×), (2) Addition of irrelevant features (up to +50% dimensions), (3) Annotation errors (label flips up to 50% of labeled set).
Preprocessing: All datasets are standardized (zero mean, unit variance) for each feature on the training set.
Reproducibility: All code, data splits, and experimental results are publicly released; 98,436 experiments are conducted in the canonical study (Han et al., 2022).

4. Evaluation Metrics

ADBench employs the following metrics:

AUCROC (Area Under Receiver Operating Characteristic Curve):

$\mathrm{AUCROC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(t))\,dt$

where $\mathrm{TPR} = \mathrm{TP}/(\mathrm{TP}+\mathrm{FN})$ , $\mathrm{FPR} = \mathrm{FP}/(\mathrm{FP}+\mathrm{TN})$ .

AUCPR (Area Under Precision–Recall Curve):

$\mathrm{AUCPR} = \int_0^1 \mathrm{Precision}(\mathrm{Recall}^{-1}(r))\,dr$

where $\mathrm{Precision} = \mathrm{TP}/(\mathrm{TP}+\mathrm{FP})$ , $\mathrm{Recall} = \mathrm{TPR}$ .

F1-score: Standard harmonic mean of precision and recall.
Relative/aggregate metrics for multi-method comparison: Average rank, win-rate, Elo rating, rescaled AUC (rAUC), and Champion Delta as introduced in recent literature (Ding et al., 3 Feb 2026).

Metrics are calculated consistently across all datasets and experiments to permit fair aggregate comparison.

5. Empirical Findings and Methodological Insights

No Dominant Unsupervised Method: Critical-difference diagrams reveal that no single unsupervised algorithm (shallow or deep) is universally optimal across ADBench's 57 datasets; the performance ordering is dataset- and anomaly-type specific (statistical test $p>0.05$ ) (Han et al., 2022).
Specialized Methods for Anomaly Types: LOF performs best for "local" anomalies, kNN for "global" anomalies, COPOD/ECOD for distributional tail anomalies, and Isolation Forest/LODA for high-dimensional scenarios.
Supervision Benefits: Semi-supervised methods (e.g., DeepSAD, DevNet, XGBOD) outperform unsupervised and supervised classifiers when only a small fraction ( $\gamma_{\ell} \leq 5\%$ ) of anomalies are labeled. With moderate to full labeling ( $\gamma_{\ell}\geq 25\%$ ), tree-based ensembles and FTTransformer attain state-of-the-art performance.
Robustness: Unsupervised methods degrade substantially (−16% median ΔAUCROC at 6× anomaly duplication), while (semi-)supervised methods exhibit significantly improved stability under duplicated anomalies, irrelevant features, and label noise.
High-Dimensional Modalities: For computer vision and NLP tasks, it is standard to use domain-specific embeddings (e.g., ResNet18, BERT) and apply tabular AD methods to the resultant vectors.

6. Impact, Research Extensions, and Derived Benchmarks

ADBench has established itself as the reference framework for AD benchmarking:

Derived Large-Scale Evaluations: ADBench forms the primary evaluation platform for foundation model methods (e.g., OUTFORMER, FoMo-0D), diffusion-based anomaly detection, advanced kernel and density-based approaches, and tabular outlier detection with curriculum learning (Livernoche et al., 2023, Kozdoba et al., 2023, Sattarov et al., 1 Aug 2025, Ding et al., 3 Feb 2026).
Modality-specific Extensions: Its methodology underpins contemporary benchmarks targeting natural language (NLP-ADBench, Text-ADBench), where domain-specific AD is cast as anomaly detection on language embeddings, with analogous evaluation metrics and experimental rigor (Li et al., 2024, Xiao et al., 16 Jul 2025).
Algorithmic Best Practices: Routine strong baselines are provided for each domain/anomaly type, serving as templates for ablation studies and method evaluations in the literature.

ADBench's datasets, evaluation scripts, and empirical results have become essential for rigorous validation of novel AD algorithms, model selection strategies (e.g., meta-learning-based automated detector selection (Li et al., 2024)), and robustness studies.

7. Guidelines and Future Directions

Algorithm Selection: Choose AD algorithms by first identifying the dominant anomaly type and data modality; employ LOF for local, kNN for global, COPOD/ECOD for tail, IForest/LODA for high-dimensional, and appropriate embedded representations for CV/NLP.
Supervision Scaling: Where labeled anomalies are scarce ( $\leq 5\%$ ), semi-supervised methods are maximally label-efficient; supervised tree-based models surpass others with moderate labeling.
Model Selection and Meta-AD: Automated model selection for unsupervised AD (meta-learning/AutoML) remains an open challenge, with current research leveraging meta-level dataset properties and detector predictions (Han et al., 2022, Li et al., 2024).
Broader Modalities and Robustness: Major extensions include application to time series, graph data, OOD/open-set detection, and synthetic anomaly generation via diffusion or contrastive models.
Reproducibility: The framework's focus on open-source code, default parameterization, and statistical protocol permits reliable reproduction and extension for new methods and settings.

ADBench stands as the authoritative resource for comprehensive, rigorous, and fair comparison of anomaly detection algorithms across modalities and supervision regimes (Han et al., 2022).