Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World (2311.10421v2)
Abstract: Anomaly detection techniques are essential in automating the monitoring of IT systems and operations. These techniques imply that machine learning algorithms are trained on operational data corresponding to a specific period of time and that they are continuously evaluated on newly emerging data. Operational data is constantly changing over time, which affects the performance of deployed anomaly detection models. Therefore, continuous model maintenance is required to preserve the performance of anomaly detectors over time. In this work, we analyze two different anomaly detection model maintenance techniques in terms of the model update frequency, namely blind model retraining and informed model retraining. We further investigate the effects of updating the model by retraining it on all the available data (full-history approach) and only the newest data (sliding window approach). Moreover, we investigate whether a data change monitoring tool is capable of determining when the anomaly detection model needs to be updated through retraining.
- “Combat Security Alert Fatigue with AI-Assisted Techniques” In Cyber Security Experimentation and Test Workshop, CSET ’21 Virtual, CA, USA: Association for Computing Machinery, 2021, pp. 9–16 DOI: 10.1145/3474718.3474723
- Firas Bayram, Bestoun S. Ahmed and Andreas Kassler “From concept drift to model degradation: An overview on performance-aware drift detectors” In Knowledge-Based Systems 245, 2022, pp. 108632 DOI: https://doi.org/10.1016/j.knosys.2022.108632
- “Predicting Disk Replacement towards Reliable Data Centers” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 2016, pp. 39–48
- Rodolfo C. Cavalcante, Leandro L. Minku and Adriano L.I. Oliveira “FEDD: Feature Extraction for Explicit Concept Drift Detection in time series” In 2016 International Joint Conference on Neural Networks (IJCNN), 2016, pp. 740–747
- “A joint model for IT operation series prediction and anomaly detection” In Neurocomputing 448, 2021, pp. 130–139
- “Outage Prediction and Diagnosis for Cloud Service Systems” In The World Wide Web Conference, WWW ’19 San Francisco, CA, USA: Association for Computing Machinery, 2019, pp. 2659–2665 DOI: 10.1145/3308558.3313501
- “Outage Prediction and Diagnosis for Cloud Service Systems” In The World Wide Web Conference, WWW ’19, 2019, pp. 2659–2665
- “AI for IT operations (AIOps) on cloud platforms: Reviews, opportunities and challenges”, 2023 arXiv:2304.04661 [cs.LG]
- Yingnong Dang, Qingwei Lin and Peng Huang “AIOps: Real-World Challenges and Research Innovations” In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2019, pp. 4–5
- “The Entropy-Based Time Domain Feature Extraction for Online Concept Drift Detection” In Entropy 21.12, 2019 DOI: 10.3390/e21121187
- “A Survey on Concept Drift Adaptation” In ACM Comput. Surv. 46.4 New York, NY, USA: Association for Computing Machinery, 2014 DOI: 10.1145/2523813
- “Time Series Forecasting in the Presence of Concept Drift: A PSO-based Approach” In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), 2017, pp. 239–246 DOI: 10.1109/ICTAI.2017.00046
- “Experience Report: System Log Analysis for Anomaly Detection” In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), 2016, pp. 207–218 DOI: 10.1109/ISSRE.2016.21
- “Diagnosing Cloud Performance Anomalies Using Large Time Series Dataset Analysis” In 2014 IEEE 7th International Conference on Cloud Computing, 2014, pp. 930–933 DOI: 10.1109/CLOUD.2014.129
- “Time-series extreme event forecasting with neural networks at Uber” In International conference on machine learning 34, 2017, pp. 1–5
- “Adopting Autonomic Computing Capabilities in Existing Large-Scale Systems” In 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2018, pp. 1–10
- “SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults” In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 92–103 DOI: 10.1109/ISSRE5003.2020.00018
- “Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution” In ACM Transactions on Software Engineering and Methodology 29.2 New York, NY, USA: Association for Computing Machinery, 2020
- “Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution” In ACM Trans. Softw. Eng. Methodol. 29.2 New York, NY, USA: Association for Computing Machinery, 2020
- “Identifying Recurrent and Unknown Performance Issues” In 2014 IEEE International Conference on Data Mining, 2014, pp. 320–329 DOI: 10.1109/ICDM.2014.96
- “Predicting Node Failure in Cloud Service Systems” In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018 Lake Buena Vista, FL, USA: Association for Computing Machinery, 2018, pp. 480–490
- “An Empirical Study of the Impact of Data Splitting Decisions on the Performance of AIOps Solutions” In ACM Trans. Softw. Eng. Methodol. 30.4 New York, NY, USA: Association for Computing Machinery, 2021
- “Towards a Consistent Interpretation of AIOps Models” In ACM Trans. Softw. Eng. Methodol. 31.1 New York, NY, USA: Association for Computing Machinery, 2021
- “A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners” In 2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN) Los Alamitos, CA, USA: IEEE Computer Society, 2023, pp. 171–183 DOI: 10.1109/CAIN58948.2023.00034
- “Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process” In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22 Pittsburgh, Pennsylvania: Association for Computing Machinery, 2022, pp. 413–425 DOI: 10.1145/3510003.3510209
- Paolo Notaro, Jorge Cardoso and Michael Gerndt “A Systematic Mapping Study in AIOps” In Service-Oriented Computing – ICSOC 2020 Workshops, 2021, pp. 110–123
- “Maintaining and Monitoring AIOps Models Against Concept Drift” In 2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN), 2023, pp. 98–99 DOI: 10.1109/CAIN58948.2023.00024
- “Are Concept Drift Detectors Reliable Alarming Systems? - A Comparative Study” In 7th Workshop on Real-time Stream Analytics, Stream Mining, CER/CEP & Stream Data Management in Big Data, 2022
- Oleksandr Provotar, Yaroslav M.) Linder and Maksym Veres “Unsupervised Anomaly Detection in Time Series Using LSTM-Based Autoencoders” In 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), 2019, pp. 513–517
- “Fourier Transform Based Spatial Outlier Mining” In Proceedings of the 10th International Conference on Intelligent Data Engineering and Automated Learning, 2009, pp. 317–324
- “Time-Series Anomaly Detection Service at Microsoft” In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19 Anchorage, AK, USA: Association for Computing Machinery, 2019, pp. 3009–3017 DOI: 10.1145/3292500.3330680
- Andrea Rosà, Lydia Y. Chen and Walter Binder “Catching failures of failures at big-data clusters: A two-level neural network approach” In 2015 IEEE 23rd International Symposium on Quality of Service (IWQoS), 2015, pp. 231–236 DOI: 10.1109/IWQoS.2015.7404739
- Nosayba El-Sayed, Hongyu Zhu and Bianca Schroeder “Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations” In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017, pp. 1333–1344 DOI: 10.1109/ICDCS.2017.317
- Sebastian Schmidl, Phillip Wenig and Thorsten Papenbrock “Anomaly Detection in Time Series: A Comprehensive Evaluation” In Proc. VLDB Endow. 15, 2022, pp. 1779–1797
- “A Review of Time-Series Anomaly Detection Techniques: A Step to Future Perspectives” In Advances in Information and Communication Springer International Publishing, 2021, pp. 865–877
- “Demystifying Numenta anomaly benchmark” In 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 1570–1577 DOI: 10.1109/IJCNN.2017.7966038
- “Assumption-Free Anomaly Detection in Time Series” In Proceedings of the 17th International Conference on Scientific and Statistical Database Management, SSDBM’2005 Santa Barbara, CA: Lawrence Berkeley Laboratory, 2005, pp. 237–240
- Renjie Wu and Eamonn J. Keogh “Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress” In IEEE Transactions on Knowledge and Data Engineering 35.3, 2023, pp. 2421–2429 DOI: 10.1109/TKDE.2021.3112126
- “Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications” In WWW ’18: Proceedings of the 2018 World Wide Web Conference, 2018
- “Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications” In Proceedings of the 2018 World Wide Web Conference, WWW ’18 Lyon, France: International World Wide Web Conferences Steering Committee, 2018, pp. 187–196 DOI: 10.1145/3178876.3185996
- “Improving Service Availability of Cloud Systems by Predicting Disk Error” In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference USA: USENIX Association, 2018, pp. 481–493
- “Time Series Outlier Detection Based on Sliding Window Prediction” In Mathematical Problems in Engineering 2014, 2014