Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging (2402.18205v2)
Abstract: Logs produced by extensive software systems are integral to monitoring system behaviors. Advanced log analysis facilitates the detection, alerting, and diagnosis of system faults. Log parsing, which entails transforming raw log messages into structured templates, constitutes a critical phase in the automation of log analytics. Existing log parsers fail to identify the correct templates due to reliance on human-made rules. Besides, These methods focus on statistical features while ignoring semantic information in log messages. To address these challenges, we introduce a cutting-edge \textbf{L}og parsing framework with \textbf{E}ntropy sampling and Chain-of-Thought \textbf{M}erging (Lemur). Specifically, to discard the tedious manual rules. We propose a novel sampling method inspired by information entropy, which efficiently clusters typical logs. Furthermore, to enhance the merging of log templates, we design a chain-of-thought method for LLMs. LLMs exhibit exceptional semantic comprehension, deftly distinguishing between parameters and invariant tokens. We have conducted experiments on large-scale public datasets. Extensive evaluation demonstrates that Lemur achieves the state-of-the-art performance and impressive efficiency.
- Scaling laws for generative mixed-modal language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 265–279. PMLR, 2023.
- Palm 2 technical report. CoRR, abs/2305.10403, 2023.
- Cross-lingual natural language generation via pre-training. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7570–7577. AAAI Press, 2020.
- Prefix-graph: A versatile log parsing approach merging prefix tree with probabilistic graph. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 2411–2422. IEEE, 2021.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
- Spell: Streaming parsing of system event logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 859–864. IEEE, 2016.
- Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 1285–1298, 2017.
- Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining, pages 149–158. IEEE, 2009.
- Investigating and improving log parsing in practice. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1566–1577, 2022.
- Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011.
- Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298, 2023.
- Logmine: Fast pattern recognition for log analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1573–1582, 2016.
- An evaluation study on log parsing and its use in log mining. In 2016 46th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pages 654–661. IEEE, 2016.
- Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS), pages 33–40. IEEE, 2017.
- Identifying impactful service system problems via log analysis. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 60–70, 2018.
- Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448, 2020.
- A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR), 54(6):1–37, 2021.
- Llmparser: A llm-based log parsing framework. arXiv preprint arXiv:2310.01796, 2023.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
- Log parsing with prompt-based few-shot learning. arXiv preprint arXiv:2302.07435, 2023.
- Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022, pages 1893–1901, 2022.
- Logprompt: Prompt engineering towards zero-shot and interpretable log analysis. arXiv preprint arXiv:2308.07610, 2023.
- Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1255–1264, 2009.
- Masayoshi Mizutani. Incremental mining of system log format. In 2013 IEEE International Conference on Services Computing, pages 595–602. IEEE, 2013.
- Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 215–224, 2016.
- Self-supervised log parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part IV, pages 122–138. Springer, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- On automatic parsing of log records. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pages 41–45. IEEE, 2021.
- An effective approach for parsing large log files. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 1–12. IEEE, 2022.
- Keiichi Shima. Length matters: Clustering system log messages using length of words. arXiv preprint arXiv:1611.03213, 2016.
- Logan: Problem diagnosis in the cloud using log-based reference models. In 2016 IEEE International Conference on Cloud Engineering (IC2E), pages 62–67. IEEE, 2016.
- Logcluster-a data clustering and pattern mining algorithm for event logs. In 2015 11th International conference on network and service management (CNSM), pages 1–7. IEEE, 2015.
- Risto Vaarandi. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764), pages 119–126. Ieee, 2003.
- Attention is all you need. NIPS, 2017.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Lpv: A log parser based on vectorization for offline and online log parsing. In 2020 IEEE International Conference on Data Mining (ICDM), pages 1346–1351. IEEE, 2020.
- Brain: Log parsing with bidirectional parallel tree. IEEE Transactions on Services Computing, 2023.
- Uilog: Improving log-based fault diagnosis by log analysis. Journal of computer science and technology, 31(5):1038–1052, 2016.