TSB-AD-M: Time Series Anomaly Detection Benchmark

Updated 1 July 2025

TSB-AD-M Benchmark is a framework that systematically assesses anomaly detection algorithms in time series using varied supervision levels and real-world datasets.
The evaluation employs structured reproducibility protocols with metrics like AUCROC and AUCPR to fairly measure performance across multiple anomaly types.
Empirical findings reveal that even minimal labeled data can significantly boost semi-supervised methods, guiding practical improvements in detection accuracy.

The TSB-AD-M Benchmark refers to a class of rigorous evaluation frameworks for anomaly detection in time series data, with particular focus on addressing the methodological and practical challenges inherent in diverse, real-world applications. The design principles and empirical findings synthesizing this benchmark trace primarily to "ADBench: Anomaly Detection Benchmark" (Han et al., 2022), which, although focused on tabular data, has established foundational structures now informing the scope and standards for time series benchmarks such as TSB-AD-M.

1. Definition and Scope

TSB-AD-M (Time Series Benchmark for Anomaly Detection–Multimodal or Multivariate, by Editor's term) encompasses the comprehensive and systematic assessment of anomaly detection (AD) algorithms over a diverse set of time series datasets. The benchmark's primary objectives are to:

Evaluate AD methods under varying supervision levels (unsupervised, semi-supervised, supervised).
Analyze algorithm performance across distinct anomaly types.
Examine robustness against noisy and corrupted data.
Provide reproducible, extensible, and fair evaluation protocols to guide research and deployment.

While the originating ADBench framework is tabular-focused, its methodology—valuing representation of real-world complexity and statistical rigor—establishes the template for time series AD benchmarks.

2. Algorithm Coverage and Evaluation Protocols

TSB-AD-M inherits and extends the comprehensive approach of ADBench in algorithmic diversity and fairness:

Algorithms Assessed: Typically includes a broad spectrum such as unsupervised (e.g., PCA, LOF, OCSVM), semi-supervised, supervised (e.g., boosted tree ensembles), modern statistical (e.g., Isolation Forest, COPOD), deep learning (e.g., DeepSVDD, DAGMM), and transformer architectures. For time series, these are supplemented with sequence models, RNNs, and temporal transformers.
Datasets: Emphasizes large-scale, realistic datasets spanning domains such as finance, healthcare, process monitoring, and incorporates both real and synthetically generated time series to ensure variety in length, dimensionality, and anomaly prevalence.
Evaluation Pipeline:
- Employs strict inductive evaluation: splits into 70% train, 30% test (stratified by anomaly ratios).
- Enforces reproducibility via repeated trials and averaged reporting.
- Adopts default hyperparameters from official implementations to reduce manual tuning bias.
Metrics: Main comparison metrics are Area Under ROC Curve (AUCROC) and Area Under Precision-Recall Curve (AUCPR), the latter being crucial where class imbalance is severe (as in rare anomaly detection).
Statistical Comparisons: Uses critical difference (CD) diagrams and Wilcoxon-Holm tests ( $p \leq 0.05$ ) to assert significance among competing methods' performances.

3. Supervision Levels and Practical Lessons

TSB-AD-M benchmarks stratify algorithm evaluation by availability of anomaly labels:

Unsupervised: No label information provided.
Semi-supervised: Small fractions of labeled anomalies (as low as 1%) are available.
Fully Supervised: Extensive label information present.

Empirical findings indicate that even minimal labels (1–5%) can enable semi-supervised methods to outperform the strongest unsupervised algorithms. In domains where labeling is costly, acquiring even limited labels should be prioritized for maximal gain in detection performance.

4. Anomaly Type Taxonomy and Inductive Bias

A cardinal feature is controlled evaluation across multiple anomaly types:

Local anomalies: Deviations detectable only in the context of local neighborhoods.
Global anomalies: Outliers relative to the distribution as a whole.
Dependency anomalies: Violations of statistical inter-feature (or inter-series) relationships.
Cluster/group anomalies: Anomalies forming distinct, coherent clusters.

A key insight is that the match between a model’s inductive bias and the generative process of anomalies (e.g., proximity-based models for local outliers, clustering for group anomalies) can outweigh the advantage conferred by additional labels—unless the labeled data are highly representative.

5. Robustness to Data Noise and Corruption

TSB-AD-M tests algorithmic resilience to three types of data perturbation:

Duplicated anomalies (masking): Artificial replication of anomaly instances diminishes unsupervised method performance (median AUCROC drop around ‒16.4%), whereas label-informed methods demonstrate robustness.
Irrelevant feature injection: Addition of random, non-informative features. Supervised methods with embedded feature selection (such as ensembles) show superior robustness.
Annotation errors: Label flipping. Semi-supervised and supervised approaches are robust to small levels of label noise (≲5%).

Consistent, programmatic perturbation ensures fairness and meaningful comparison across algorithms and datasets.

6. Open Source and Reproducibility Commitments

A central tenet is full openness:

All code, preprocessed datasets, experiment scripts, and result summaries are released under permissive licenses.
Experiment environments specify strict dependency versions for reproducibility.
Modular pipelines allow for the straightforward integration of new detection algorithms or datasets.
Statistical results—including metrics tables and significance test outputs—are published to support secondary research such as meta-learning or automated model selection.

7. Limitations and Future Challenges

Several open problems are identified for ongoing and future benchmarks modeled after ADBench:

No single unsupervised method is universally dominant; algorithm selection or ensembling is imperative.
Sophisticated semi-supervised paradigms are needed to harness sparse labels efficiently.
Algorithms must better leverage or infer anomaly type priors in the absence of labels.
There is a pressing requirement for robust unsupervised methods that remain effective amid data noise and high dimensionality.
Large-scale, statistically rigorous, and reproducible benchmarking practices (e.g., as exemplified by critical difference diagrams) should become standard in the field.

Summary Table: Characteristic Comparison with Predecessors

Benchmark	# Datasets	# Algorithms	Real/Synthetic Data	Shallow/DL	Supervised Modes	Types of Anomaly
ADBench/TSB-AD-M	57	30	Both	Yes/Yes	Yes	Yes
Others	<23	<19	Typically one	Varies	Mostly unsupervised	No

Conclusion

TSB-AD-M, following the structural and empirical guidance of ADBench (Han et al., 2022), sets a rigorous standard for time series anomaly detection benchmarking by encompassing a broad range of supervision, anomaly types, and data corruptions. Its reproducible and extensible design equips researchers and practitioners with the means to perform comprehensive, transparent, and fair comparisons, advancing the science and application of anomaly detection in complex, real-world time series environments.

PDF Markdown Chat (Upgrade)

References (1)

1.

ADBench: Anomaly Detection Benchmark (2022)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now