Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress (2009.13807v5)

Published 29 Sep 2020 in cs.LG and stat.ML

Abstract: Time series anomaly detection has been a perennially important topic in data science, with papers dating back to the 1950s. However, in recent years there has been an explosion of interest in this topic, much of it driven by the success of deep learning in other domains and for other time series tasks. Most of these papers test on one or more of a handful of popular benchmark datasets, created by Yahoo, Numenta, NASA, etc. In this work we make a surprising claim. The majority of the individual exemplars in these datasets suffer from one or more of four flaws. Because of these four flaws, we believe that many published comparisons of anomaly detection algorithms may be unreliable, and more importantly, much of the apparent progress in recent years may be illusionary. In addition to demonstrating these claims, with this paper we introduce the UCR Time Series Anomaly Archive. We believe that this resource will perform a similar role as the UCR Time Series Classification Archive, by providing the community with a benchmark that allows meaningful comparisons between approaches and a meaningful gauge of overall progress.

Citations (176)

Summary

  • The paper reveals that popular anomaly detection benchmarks suffer from triviality, unrealistic anomaly densities, mislabeled ground truth, and run-to-failure bias.
  • The paper demonstrates that simple MATLAB code can solve supposedly complex benchmark problems, questioning the effectiveness of advanced models.
  • The paper introduces the UCR Time Series Anomaly Archive to offer diverse, rigorously validated datasets for a more accurate assessment of anomaly detection methods.

Critique of Current Time Series Anomaly Detection Benchmarks

Wu and Keogh's paper provides a critical examination of time series anomaly detection benchmarks, revealing substantial flaws that undermine the reliability of these benchmarks and potentially mislead assessments of algorithmic progress. The authors assert that popular datasets, such as those developed by Yahoo, Numenta, and NASA, suffer from four primary weaknesses: triviality, unrealistic anomaly density, mislabeled ground truth, and run-to-failure bias. These flaws contribute to the possibility that reported advancements in anomaly detection might be illusory.

Key Findings

The authors highlight "triviality" as a core flaw, demonstrating that many problems in existing benchmarks can be solved using simple, one-liner MATLAB code, rather than sophisticated deep learning models. For instance, Wu and Keogh show that the Yahoo benchmark can be addressed with minimal computational effort, which raises doubts about the claimed effectiveness of complex algorithms developed within the community.

Another issue is "unrealistic anomaly density," where anomalies in datasets are either excessively contiguous or frequently occurring, which diverges from real-world scenarios where anomalies are sparse. This skewness in anomaly density challenges the fundamental assumptions inherent to the anomaly detection task, complicating the assessment of algorithm performance.

"Mislabeled ground truth" and "run-to-failure bias" further complicate the scenario. The paper provides instances of datasets where anomalies are incorrectly labeled, potentially leading to misleading performance comparisons between algorithms. Run-to-failure bias indicates that anomalies tend to appear near the end of datasets, skewing results towards algorithms that label the final datapoints as anomalies by default.

The UCR Time Series Anomaly Archive

In response to these issues, the authors introduce the UCR Time Series Anomaly Archive. This new resource aims to address the aforementioned flaws by providing datasets that range in difficulty and include only one anomaly per test series. The archive is diverse, spanning domains such as medicine, industry, and robotics, and ensures datasets are free from labeling errors by using out-of-band data for validation.

Implications and Recommendations

The implications of this paper are significant for the field of anomaly detection. Researchers should approach current benchmarks with skepticism and consider transitioning to new datasets that offer more rigorous tests, such as the UCR Anomaly Archive. Additionally, time series anomaly detection algorithms should be evaluated based on their invariances, which will aid practitioners in selecting the appropriate method for specific data characteristics.

The paper suggests abandoning flawed benchmarks and calls for clearer communication of algorithmic invariances. Researchers should visualize data and algorithm outputs to better understand algorithm behavior across different datasets. The authors also encourage revisiting the assumption that deep learning is the default solution for anomaly detection tasks, advocating for the reconsideration of simpler, effective methods.

Conclusion

Wu and Keogh’s work is pivotal in reshaping the narrative around time series anomaly detection benchmarks. It challenges the community to critically reassess how anomaly detection progress is reported and encourages the use of more reliable datasets and evaluation strategies. Ultimately, the paper serves as a catalyst for future research that strives toward genuine advancements in the field.

X Twitter Logo Streamline Icon: https://streamlinehq.com