Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Effectiveness of Log Representation for Log-based Anomaly Detection (2308.08736v3)

Published 17 Aug 2023 in cs.SE and cs.LG

Abstract: Logs are an essential source of information for people to understand the running status of a software system. Due to the evolving modern software architecture and maintenance methods, more research efforts have been devoted to automated log analysis. In particular, ML has been widely used in log analysis tasks. In ML-based log analysis tasks, converting textual log data into numerical feature vectors is a critical and indispensable step. However, the impact of using different log representation techniques on the performance of the downstream models is not clear, which limits researchers and practitioners' opportunities of choosing the optimal log representation techniques in their automated log analysis workflows. Therefore, this work investigates and compares the commonly adopted log representation techniques from previous log analysis research. Particularly, we select six log representation techniques and evaluate them with seven ML models and four public log datasets (i.e., HDFS, BGL, Spirit and Thunderbird) in the context of log-based anomaly detection. We also examine the impacts of the log parsing process and the different feature aggregation approaches when they are employed with log representation techniques. From the experiments, we provide some heuristic guidelines for future researchers and developers to follow when designing an automated log analysis workflow. We believe our comprehensive comparison of log representation techniques can help researchers and practitioners better understand the characteristics of different log representation techniques and provide them with guidance for selecting the most suitable ones for their ML-based log analysis workflow.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Failure diagnosis using decision trees. In International Conference on Autonomic Computing, 2004. Proceedings., pages 36–43. IEEE.
  2. Experience report: Deep learning-based system log analysis for anomaly detection. arXiv preprint arXiv:2107.05908.
  3. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 217–231.
  4. Logram: Efficient log parsing using n𝑛nitalic_nn-gram dictionaries. IEEE Transactions on Software Engineering, 48(3), 879–892.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 1285–1298.
  7. Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 1333–1344. IEEE.
  8. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining, pages 149–158. IEEE.
  9. Contextual analysis of program logs for understanding system behaviors. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 397–400. IEEE.
  10. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  11. Automated system monitoring and notification with swatch. In LISA, volume 93, pages 145–152. Monterey, CA.
  12. An evaluation study on log parsing and its use in log mining. In 2016 46th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pages 654–661. IEEE.
  13. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS), pages 33–40. IEEE.
  14. Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), pages 207–218. IEEE.
  15. Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448.
  16. A survey on automated log analysis for reliability engineering. ACM Computing Surveys (CSUR), 54(6), 1–37.
  17. A quantitative causal analysis for network log data. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pages 1437–1442. IEEE.
  18. Use of log data for predictive analytics through data mining. Current Trends In Technology And Science, 3(3).
  19. Guidelines for assessing the accuracy of log message template identification techniques. In Proceedings of the 44th International Conference on Software Engineering, pages 1095–1106.
  20. Log-based anomaly detection without log parsing. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 492–504. IEEE.
  21. Log-based anomaly detection with deep learning: how far are we? In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pages 1356–1367. IEEE.
  22. Log parsing with prompt-based few-shot learning. arXiv preprint arXiv:2302.07435.
  23. Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), pages 92–103. IEEE.
  24. Failure prediction in ibm bluegene/l event logs. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 583–588. IEEE.
  25. Using black-box performance models to detect performance regressions under varying workloads: an empirical study. Empirical Software Engineering, 25(5), 4130–4160.
  26. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 1–39.
  27. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022, pages 1893–1901.
  28. Mining invariants from console logs for system problem detection. In 2010 USENIX Annual Technical Conference (USENIX ATC 10).
  29. Detecting anomaly in big data system logs using convolutional neural network. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pages 151–158. IEEE.
  30. An empirical study of the impact of data splitting decisions on the performance of aiops solutions. ACM Transactions on Software Engineering and Methodology (TOSEM), 30(4), 1–38.
  31. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 4739–4745. International Joint Conferences on Artificial Intelligence Organization.
  32. Logclass: Anomalous log identification and classification with partial labels. IEEE Transactions on Network and Service Management, 18(2), 1870–1884.
  33. Structured comparative analysis of systems logs to diagnose performance problems. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 353–366.
  34. Self-attentive classification-based anomaly detection in unstructured logs. In 2020 IEEE International Conference on Data Mining (ICDM), pages 1196–1201. IEEE.
  35. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. arXiv preprint arXiv:1605.07766.
  36. What supercomputers say: A study of five system logs. In 37th annual IEEE/IFIP international conference on dependable systems and networks (DSN’07), pages 575–584. IEEE.
  37. Advances and challenges in log analysis. Communications of the ACM, 55(2), 55–61.
  38. Prewett, J. E. (2003). Analyzing cluster log files using logsurfer. In Proceedings of the 4th Annual Conference on Linux Clusters. Citeseer.
  39. Rouillard, J. P. (2004). Real-time log file analysis using the simple event correlator (sec). In LISA, volume 4, pages 133–150.
  40. Impact of sample size and variability on the power and type i error rates of equivalence tests: A simulation study. Practical Assessment, Research, and Evaluation, 19(1), 11.
  41. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513–523.
  42. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In 5th USENIX Conference on File and Storage Technologies (FAST 07), San Jose, CA. USENIX Association.
  43. An exploratory study of the evolution of communicated information about the execution of large software systems. Journal of Software: Evolution and Process, 26(1), 3–26.
  44. An empirical comparison of model validation techniques for defect prediction models. IEEE Transactions on Software Engineering, 43(1), 1–18.
  45. The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering, 45(7), 683–711.
  46. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2.
  47. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86), 2579–2605.
  48. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  49. Glad-paw: Graph-based log anomaly detection by position aware weighted graph attention network. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part I, pages 66–77. Springer.
  50. Anomaly detection of system logs based on natural language processing and deep learning. In 2018 4th International Conference on Frontiers of Signal Processing (ICFSP), pages 140–144. IEEE.
  51. Loggd: Detecting anomalies from system logs by graph neural networks. arXiv preprint arXiv:2209.07869.
  52. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117–132.
  53. Sherlog: error diagnosis by connecting clues from run-time logs. In Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems, pages 143–154.
  54. Characterizing logging practices in open-source software. In 2012 34th International Conference on Software Engineering (ICSE), pages 102–112. IEEE.
  55. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 807–817.
  56. Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 121–130. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xingfang Wu (6 papers)
  2. Heng Li (138 papers)
  3. Foutse Khomh (140 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com