LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing (2404.18001v1)
Abstract: Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using LLMs for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub (2023).
- Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine 213 (2022), 106504. https://doi.org/10.1016/j.cmpb.2021.106504
- Stability of topic modeling via matrix factorization. Expert Systems with Applications 91 (2018), 159–169. https://doi.org/10.1016/j.eswa.2017.08.047
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- An experience report of generating load tests using log-recovered workloads at varying granularities of user behaviour. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE, 669–681.
- Song Chen and Hai Liao. 2022. Bert-log: Anomaly detection for system logs based on pre-trained language model. Applied Artificial Intelligence 36, 1 (2022), 2145642.
- Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents. arXiv preprint arXiv:2305.15778 (2023).
- Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 8 (1995), 790–799. https://doi.org/10.1109/34.400568
- INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. arXiv preprint arXiv:2306.04757 (2023).
- Scaling Instruction-Finetuned Language Models. https://doi.org/10.48550/ARXIV.2210.11416
- Emerging trends: A gentle introduction to fine-tuning. Natural Language Engineering 27, 6 (2021), 763–778.
- Logram: Efficient Log Parsing Using n𝑛nitalic_n n-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (2020), 879–892.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Openprompt: An open-source framework for prompt-learning. arXiv preprint arXiv:2111.01998 (2021).
- Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
- A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).
- Min Du and Feifei Li. 2019. Spell: Online Streaming Parsing of Large Unstructured System Logs. IEEE Transactions on Knowledge and Data Engineering 31, 11 (2019), 2213–2227. https://doi.org/10.1109/TKDE.2018.2875442
- Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
- Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In 2009 Ninth IEEE International Conference on Data Mining. 149–158. https://doi.org/10.1109/ICDM.2009.60
- Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3816–3830. https://doi.org/10.18653/v1/2021.acl-long.295
- Diversity in Machine Learning. IEEE Access 7 (2019), 64323–64350. https://doi.org/10.1109/ACCESS.2019.2917620
- Characterizing the natural language descriptions in software logging statements. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 178–189.
- Towards automated log parsing for large-scale log data analysis. IEEE Transactions on Dependable and Secure Computing 15, 6 (2017), 931–944.
- Drain: An Online Log Parsing Approach with Fixed Depth Tree. In 2017 IEEE International Conference on Web Services (ICWS). 33–40. https://doi.org/10.1109/ICWS.2017.13
- A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6 (2021), 1–37.
- Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 207–218.
- Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Assessing the Generalizability of Code2vec Token Embeddings. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1–12. https://doi.org/10.1109/ASE.2019.00011
- Guidelines for Assessing the Accuracy of Log Message Template Identification Techniques. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1095–1106. https://doi.org/10.1145/3510003.3510101
- Impact of Log Parsing on Log-based Anomaly Detection. arXiv preprint arXiv:2305.15897 (2023).
- Van-Hoang Le and Hongyu Zhang. 2023a. An Evaluation of Log Parsing with ChatGPT. arXiv preprint arXiv:2306.01590 (2023).
- Van-Hoang Le and Hongyu Zhang. 2023b. Log Parsing with Prompt-based Few-shot Learning. In 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE).
- LAnoBERT: System log anomaly detection based on BERT masked language model. arXiv preprint arXiv:2111.09564 (2021).
- Comparing code explanations created by students and large language models. arXiv preprint arXiv:2304.03938 (2023).
- Studying software logging using topic models. Empirical Software Engineering 23 (2018), 2655–2694.
- Did We Miss Something Important? Studying and Exploring Variable-Aware Log Abstraction. arXiv preprint arXiv:2304.11391 (2023).
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965.
- Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop. arXiv preprint arXiv:2306.05032 (2023).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022. 1893–1901.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
- Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 389–396.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 (2022).
- Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. arXiv preprint arXiv:2305.16938 (2023).
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021, 12 (2021), 124003.
- Self-supervised log parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part IV. Springer, 122–138.
- Improving language understanding by generative pre-training. (2018).
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- From ChatGPT-3 to GPT-4: a significant advancement in ai-driven NLP tools. Journal of Engineering and Emerging Technologies 2, 1 (2023), 1–11.
- Google Research. 2023. The Flan Collection: Advancing open source methods for instruction tuning – Google Research Blog. https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html. (Accessed on 07/16/2023).
- Keiichi Shima. 2016. Length matters: Clustering system log messages using length of words. arXiv preprint arXiv:1611.03213 (2016).
- A Theoretical Framework for Understanding the Relationship Between Log Parsing and Anomaly Detection. In Runtime Verification: 21st International Conference, RV 2021, Virtual Event, October 11–14, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg, 277–287. https://doi.org/10.1007/978-3-030-88494-9_16
- C-Brain: A Deep Learning Accelerator That Tames the Diversity of CNNs through Adaptive Data-Level Parallelization. In Proceedings of the 53rd Annual Design Automation Conference (Austin, Texas) (DAC ’16). Association for Computing Machinery, New York, NY, USA, Article 123, 6 pages. https://doi.org/10.1145/2897937.2897995
- LogSig: Generating System Events from Raw Textual Logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM ’11). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2063576.2063690
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- R. Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764). 119–126. https://doi.org/10.1109/IPOM.2003.1251233
- Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference. arXiv preprint arXiv:2303.04673 (2023).
- Do language models perform generalizable commonsense inference? arXiv preprint arXiv:2106.11533 (2021).
- Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur) 53, 3 (2020), 1–34.
- Would you like a quick peek? providing logging support to monitor data processing in big data applications. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 516–526.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668 (2023).
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839 (2023).
- Sherlog: error diagnosis by connecting clues from run-time logs. In Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems. 143–154.
- GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=-Aw0rrrPUF
- zero_nlp contributors. 2023. ”A large collection of large language model-powered solutions in Chinese”. https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/simple_thu_chatglm6b. (Accessed on 06/25/2023).
- Anomaly Detection via Mining Numerical Workflow Relations from Logs. In 2020 International Symposium on Reliable Distributed Systems (SRDS). 195–204. https://doi.org/10.1109/SRDS51746.2020.00027
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
- Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.
- Assessing Generalizability of CodeBERT. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 425–436. https://doi.org/10.1109/ICSME52107.2021.00044
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
- Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 121–130.
- Chen Zhuge and Risto Vaarandi. 2017. Efficient Event Log Mining with LogClusterC. In 2017 ieee 3rd international conference on big data security on cloud (bigdatasecurity), ieee international conference on high performance and smart computing (hpsc), and ieee international conference on intelligent data and security (ids). 261–266. https://doi.org/10.1109/BigDataSecurity.2017.26