ADBench: Anomaly Detection Benchmark Suite

Updated 21 April 2026

ADBench is a comprehensive, open-source benchmark suite for anomaly detection that aggregates 57 diverse real-world tabular datasets for systematic evaluation.
It standardizes comparisons across various supervision regimes and anomaly types, using clear metrics like AUC-ROC and AUC-PR to guide method selection.
Empirical findings reveal performance nuances, highlighting both the strengths and biases of current methods and motivating future enhancements in anomaly detection research.

ADBench is a comprehensive, open-source benchmark suite for anomaly detection (AD) algorithms, centered on real-world tabular data. It provides a unifying empirical platform for comparing diverse AD methods across varying degrees of supervision, anomaly types, and noise/corruption regimes, with an emphasis on methodological rigor and reproducibility. ADBench and its derivatives are recognized as the de facto standard for systematic evaluation in the anomaly detection literature (Han et al., 2022, Ding et al., 10 Feb 2026).

1. Dataset Composition and Characteristics

ADBench consists of 57 datasets aggregated from several widely used anomaly detection repositories: ODDS (26 datasets), DAMI (19), Emmott (11), DevNet (6), and additionally includes 6 image-embedding and 5 text-embedding datasets (treated as high-dimensional tabular) (Ding et al., 10 Feb 2026, Han et al., 2022). Modalities span purely tabular tasks, as well as tabularized representations of computer vision (CV) and NLP corpora.

Key statistical attributes are:

Statistic	Range	Mean	Median
Samples	80 – 619,326	28,013	5,393
Feature dim	3 – 1,555	371	512
Outlier fraction	0.03% – 39.9%	10.0%	5.0%

Coverage includes medical, finance, manufacturing, intrusion detection, and science domains. However, recent critical appraisals show ADBench is dominated by low-dimensional settings (~50% have $d<20$ ) and “global, Gaussian-noise–like” outliers, which can bias evaluations toward distance-based algorithms (Ding et al., 10 Feb 2026).

2. Types of Anomalies and Task Construction

Datasets capture a range of outlier types:

Semantic anomalies: Injected via labels from real-world tasks (e.g., fraud, intrusion, faults).
Statistical outliers: Introduced through One-vs-Rest repurposing of multiclass data.
Synthetic anomalies: For controlled ablations, four anomaly generation modes are used:
- Global: Uniform outliers further from the typical value range.
- Local: Points that are normal globally but anomalous within their neighborhood.
- Dependency: Violations in multivariate statistical dependencies.
- Cluster: Small, compact clusters distinct from main clusters (Han et al., 2022, Kar et al., 3 Feb 2026).

Corruption/noise experiments encompass duplicated anomalies, addition of irrelevant features, and randomized label errors. For synthetic studies, ADBench enables injection of anomalies and noise at controlled intensities.

3. Evaluation Protocols and Metrics

ADBench provides a unified protocol for method evaluation under inductive learning settings:

Supervision regimes:
- Unsupervised: Training on unlabeled data, possibly containing anomalies.
- One-class: Training set of normal points only.
- Semi-supervised: A small fraction of anomalies labeled in training.
- Supervised: Full label access (for comparative purposes).
Splits: The canonical protocol uses 70% training and 30% test split (stratified), repeated three times with different random seeds. However, ADBench itself is unsplit by default and expects methods to handle the entire dataset in an unsupervised fashion, unless the regime specifies otherwise (Ding et al., 10 Feb 2026).
Metrics:
- Area Under Receiver Operating Characteristic (AUC-ROC):
$\operatorname{AUC} = \int_{0}^{1} \operatorname{TPR}(\operatorname{FPR}^{-1}(\tau))\,d\tau$ - Area Under Precision-Recall Curve (AUC-PR):

$\mathrm{AP} = \int_{0}^{1} p(r)\,dr,\; p(r) = \max_{t:r(t)=r}\mathrm{precision}(t)$ - F1 Score: Evaluated at the precision-recall maximizing threshold. - Aggregate metrics: Average AUC, mean per-method ranking over datasets, and statistical tests (Wilcoxon–Holm, Friedman) are used for global comparison (Han et al., 2022, Ding et al., 10 Feb 2026, Kozdoba et al., 2023).

4. Canonical Algorithm Families and Scoring Functions

ADBench implements and benchmarks a broad variety of algorithms, including:

Classical statistical/density-based: ECOD, COPOD, HBOS, PCA
Distance-based: kNN, LOF
Reconstruction-based: Autoencoders, VAE, DAGMM
Ensemble-based: Isolation Forest, LODA
Deep methods: DeepSVDD, DAGMM, DevNet, RepEN, XGBOD, FTTransformer, Fully Supervised Tree Boosting

Notable baseline scoring functions include:

Method	Formula
HBOS	$\prod_{j=1}^{d} \frac{1}{\text{hist}_j(x_j) + \epsilon}$
kNN	$\\|x - x_{(k)}\\|_2$
LOF	$\frac{\sum_{o \in N_k(x)} \mathrm{lrd}(o)}{k \cdot \mathrm{lrd}(x)}$
PCA	$\\|x - WW^\top x\\|_2^2$
DeepSVDD	$\\| \phi_\theta(x) - c \\|$

Advanced methods include DTE-NP (diffusion-time estimation, closely related in ranking to kNN); foundation models (FoMo-0D, OUTFORMER, ICLAD); and Sobolev regularized pre-densities (SOSREP) (Livernoche et al., 2023, Ding et al., 3 Feb 2026, Wei et al., 19 Mar 2026, Kozdoba et al., 2023).

5. Principal Empirical Findings

Several large-scale comparative studies reveal the following:

No strict unsupervised winner: Critical-difference diagrams show that no single unsupervised method (e.g., LOF, IForest, ECOD, KNN) is statistically superior across the 47 tabular sets (Han et al., 2022).
Nature of anomaly matters: Performance is anomaly-type specific; LOF excels for local, kNN for global, OCSVM for cluster anomalies.
Semi-supervised outperforms with minimal labels: As little as 1% labeled anomalies enable semi-supervised methods to surpass best unsupervised detectors. Trees and transformer models dominate under label guidance (Han et al., 2022).
Deep tabular FMs, diffusion, and kernel models: Foundation models such as OUTFORMER and ICLAD, mean-shift (MSDE), diffusion models (DTE), and Sobolev pre-densities (SOSREP) achieve high AUCs, fast inference, and robust average ranks. For example:
- OUTFORMER: Avg. rank 2.26, rAUC = 0.986, winrate = 0.73 (Ding et al., 3 Feb 2026)
- SOSREP: Mean AUC-ROC = 0.792, rank = 17.6/18 (Kozdoba et al., 2023)
- MSDE: AUC-ROC = 0.922 (average), robust under noise corruption (Kar et al., 3 Feb 2026)
- ICLAD: One-class AUC-ROC = 0.8397, unsupervised = 0.7517, semi-supervised with 10% labels ≈ 0.85 (Wei et al., 19 Mar 2026)
Likelihood inversion is absent in tabular AD: The "counterintuitive phenomenon" where likelihood-based generative models assign high likelihood to anomalies is not observed in ADBench tabular sets, in contrast to image/OOD domains (Kim et al., 10 Feb 2026).
Strong performance from distance-based baselines: Non-parametric DTE and kNN matching neural/ensemble methods, indicating that many datasets are dominated by globally distinct outliers (Ding et al., 3 Feb 2026, Livernoche et al., 2023, Ding et al., 10 Feb 2026).

6. Limitations, Demystification, and Practical Guidance

Analytical studies reveal systematic biases and shortcomings:

Limited diversity and scale: ADBench’s 57 datasets are insufficient for reliable method ablation; t-SNE meta-embedding shows a tight cluster of tasks, with under-representation of high-dimensional and complex anomalies (Ding et al., 10 Feb 2026).
Bias towards Gaussian/global outliers: Benchmark is dominated by global outliers compatible with kNN/DTE scoring; more sophisticated methods do not show universal gains on such benchmarks.
Duplications and inconsistent data: Existence of near-duplicate datasets and datasets with unreliable anomaly counts (e.g., only 10 anomalies in Wine).
No standardized splits or private leaderboards: Evaluations can be sensitive to inclusion/exclusion of single or similar datasets; risk of overfitting to known benchmarks.
Noise and corruption: Supervised models exhibit strong robustness against up to 50% irrelevant features and label flips; unsupervised ones degrade more severely.

Despite these caveats, ADBench remains the primary reference for unsupervised and weakly-supervised anomaly detection. Method selection guidance is provided based on expected anomaly-type, data quality, and available label fraction.

7. Successors and Directions for Benchmarking

ADBench’s influence extends to specialized and next-generation benchmarks:

NLP-ADBench: Expands the ADBench paradigm to textual anomaly detection, with curated one-class tasks mapped from major text corpora, and systematic transformer-embedding–based baselines (Li et al., 2024).
MacrOData: Addresses ADBench’s scale and diversity limitations with 2,400+ tabular datasets, standardized splits, and a public/private leaderboard (Ding et al., 10 Feb 2026).
Algorithmic advances: ADBench has catalyzed progress in foundation-model–based anomaly detection (OUTFORMER, ICLAD), curriculum/meta-learning, and interpretable statistical scoring for tabular anomaly detection (Ding et al., 3 Feb 2026, Wei et al., 19 Mar 2026).
Benchmarks for other domains: e.g., time series, graphs, and realistic application scenarios are in active development.

A key implication is that new algorithms should be validated on more statistically diverse benchmarks, beyond ADBench, for credible performance reporting. ADBench’s legacy is the standardization of evaluation and transparent empirical comparison in anomaly detection research.