LogSD: Detecting Anomalies from System Logs through Self-supervised Learning and Frequency-based Masking (2404.11294v2)
Abstract: Log analysis is one of the main techniques that engineers use for troubleshooting large-scale software systems. Over the years, many supervised, semi-supervised, and unsupervised log analysis methods have been proposed to detect system anomalies by analyzing system logs. Among these, semi-supervised methods have garnered increasing attention as they strike a balance between relaxed labeled data requirements and optimal detection performance, contrasting with their supervised and unsupervised counterparts. However, existing semi-supervised methods overlook the potential bias introduced by highly frequent log messages on the learned normal patterns, which leads to their less than satisfactory performance. In this study, we propose LogSD, a novel semi-supervised self-supervised learning approach. LogSD employs a dual-network architecture and incorporates a frequency-based masking scheme, a global-to-local reconstruction paradigm and three self-supervised learning tasks. These features enable LogSD to focus more on relatively infrequent log messages, thereby effectively learning less biased and more discriminative patterns from historical normal data. This emphasis ultimately leads to improved anomaly detection performance. Extensive experiments have been conducted on three commonly-used datasets and the results show that LogSD significantly outperforms eight state-of-the-art benchmark methods.
- 2019. LogAnomaly Code Repository. https://github.com/donglee-afar/logdeep.
- 2021. DeepLoglizer Code Repository. https://github.com/logpai/deep-loglizer.
- 2021. LogBert Code Repository. https://github.com/HelenGuohx/logbert.
- 2021. OC4Seq Code Repository. https://github.com/wzwtrevor/Multi-Scale-One-Class-Recurrent-Neural-Networks.
- 2021. PLELog Code Repository. https://github.com/LeonYang95/PLELog.
- 2022. CAT Code Repository. https://github.com/mmichaelzhang/CAT.
- Ganomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14. Springer, 622–637.
- logs2graphs: Data-driven graph representation and visualization of log data.
- Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 5 (2000), 412–424.
- Fingerprinting the datacenter: automated classification of performance crises. In Proceedings of the 5th European conference on Computer systems. 111–124.
- Failure diagnosis using decision trees. In International Conference on Autonomic Computing, 2004. Proceedings. IEEE, 36–43.
- Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection. arXiv preprint arXiv:2107.05908 (2021).
- PILAR: Studying and Mitigating the Influence of Configurations on Log Parsing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 818–829.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298.
- Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 213–220.
- Amir Farzad and T Aaron Gulliver. 2020. Unsupervised log message anomaly detection. ICT Express 6, 3 (2020), 229–237.
- An empirical study of the impact of log parsers on the performance of log-based anomaly detection. Empirical Software Engineering 28, 1 (2023), 6.
- Logbert: Log anomaly detection via bert. In 2021 international joint conference on neural networks (IJCNN). IEEE, 1–8.
- He Haibo and Ma Yunqian. 2013. Imbalanced learning: foundations, algorithms, and applications. Wiley-IEEE Press 1, 27 (2013), 12.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.
- Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). IEEE, 33–40.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Self-supervised masking for unsupervised anomaly detection and localization. IEEE Transactions on Multimedia (2022).
- Jin Huang and Charles X Ling. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering 17, 3 (2005), 299–310.
- Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
- Deep Learning for Anomaly Detection in Log Data: A Survey. arXiv preprint arXiv:2207.03820 (2022).
- Luigi Lavazza and Sandro Morasca. 2022. Comparing ϕitalic-ϕ\phiitalic_ϕ and the F-measure as performance metrics for software-related classifications. Empirical Software Engineering 27, 7 (2022), 185.
- Van-Hoang Le and Hongyu Zhang. 2021. Log-based anomaly detection without log parsing. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 492–504.
- Van Hoang Le and Hongyu Zhang. 2022. Log-based Anomaly Detection with Deep Learning: How Far Are We? arXiv preprint arXiv:2202.04301 (2022).
- Are They All Good? Studying Practitioners’ Expectations on the Readability of Log Messages. arXiv preprint arXiv:2308.08836 (2023).
- Graph Neural Network based Log Anomaly Detection and Explanation. arXiv preprint arXiv:2307.00527 (2023).
- Failure prediction in ibm bluegene/l event logs. In Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE, 583–588.
- Using black-box performance models to detect performance regressions under varying workloads: an empirical study. Empirical Software Engineering 25 (2020), 4130–4160.
- Log clustering based problem identification for online service systems. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). IEEE, 102–111.
- Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413–422.
- Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services. arXiv preprint arXiv:2308.09937 (2023).
- Mining Invariants from Console Logs for System Problem Detection.. In USENIX Annual Technical Conference. 1–14.
- Detecting anomaly in big data system logs using convolutional neural network. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing. IEEE, 151–158.
- Deep graph-level anomaly detection by glocal knowledge distillation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 704–714.
- Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 1255–1264.
- A semantic-aware representation framework for online log analysis. In 2020 29th International Conference on Computer Communications and Networks (ICCCN). IEEE, 1–7.
- Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.. In IJCAI, Vol. 19. 4739–4745.
- Self-attentive classification-based anomaly detection in unstructured logs. In 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 1196–1201.
- Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. arXiv preprint arXiv:1605.07766 (2016).
- The best of both worlds: integrating semantic features with expert features for defect prediction and localization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 672–683.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
- Deep one-class classification. In International conference on machine learning. PMLR, 4393–4402.
- Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14902–14912.
- Estimating the support of a high-dimensional distribution. Neural computation 13, 7 (2001), 1443–1471.
- S2GAE: Self-Supervised Graph Autoencoders are Generalizable Learners with Graph Masking. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 787–795.
- USENIX. 2008. CFDR DATA. GitHub. https://www.usenix.org/cfdr-data Accessed: 3/3/2023.
- GLAD-PAW: Graph-Based Log Anomaly Detection by Position Aware Weighted Graph Attention Network. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 66–77.
- Multi-scale one-class recurrent neural networks for discrete event sequence anomaly detection. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3726–3734.
- Unsupervised anomaly detection with distillated teacher-student network ensemble. Entropy 23, 2 (2021), 201.
- Self-supervised representation learning via latent graph prediction. In International Conference on Machine Learning. PMLR, 24460–24477.
- LogGD: Detecting Anomalies from System Logs by Graph Neural Networks. arXiv preprint arXiv:2209.07869 (2022).
- LogDP: Combining Dependency and Proximity for Log-Based Anomaly Detection. In International Conference on Service-Oriented Computing. Springer, 708–716.
- Fascinating Supervisory Signals and Where to Find Them: Deep Anomaly Detection with Scale Learning. In Proceedings of the 40th International Conference on Machine Learning (Poster), ICML.
- Largescale system problem detection by mining console logs. Proceedings of SOSP’09 (2009).
- Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 117–132.
- Semi-supervised log-based anomaly detection via probabilistic label estimation. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1448–1460.
- Jingxiu Yao and Martin Shepperd. 2021. The impact of using biased performance metrics on software defect prediction research. Information and Software Technology 139 (2021), 106664.
- Anomaly detection via mining numerical workflow relations from logs. In 2020 International Symposium on Reliable Distributed Systems (SRDS). IEEE, 195–204.
- Cat: Beyond efficient transformer for content-aware anomaly detection in event sequences. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4541–4550.
- Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.
- GitHub. https://github.com/logpai/loghub
- Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics. In IEEE International Symposium on Software Reliability Engineering (ISSRE).
- Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 121–130.
- Yongzheng Xie (6 papers)
- Hongyu Zhang (147 papers)
- Muhammad Ali Babar (35 papers)