X-lifecycle Learning for Cloud Incident Management using LLMs (2404.03662v1)
Abstract: Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle SDLC to generate insights for detection, root causing and mitigating of incidents. Recent advancements in LLMs LLMs created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.
- Cloud monitoring: A survey. Computer Networks 57, 9 (2013), 2093–2115.
- Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
- Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. In 45th International Conference on Software Engineering.
- An analysis of {{\{{network-partitioning}}\}} failures in cloud systems. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 51–68.
- Picking Pearl From Seabed: Extracting Artefacts from Noisy Issue Triaging Collaborative Conversations for Hybrid Cloud Services. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 12440–12446.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (Eds.). Association for Computational Linguistics, Ann Arbor, Michigan, 65–72.
- DeCaf: diagnosing and triaging performance issues in large-scale cloud services. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 201–210.
- Continuous incident triage for large-scale online service systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 364–375.
- How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 373–384.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents. arXiv preprint arXiv:2305.15778 (2023).
- Identifying linked incidents in large-scale online service systems. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 304–314.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Jianru Ding. 2020. Characterizing Service Level Objectives for Cloud Services: Motivation of Short-Term Cache Allocation Performance Modeling. Ph. D. Dissertation. The Ohio State University.
- Characterizing service level objectives for cloud services: Realities and myths. In 2019 IEEE International Conference on Autonomic Computing (ICAC). IEEE, 200–206.
- Improving automatically generated code from Codex via Automated Program Repair. arXiv preprint arXiv:2205.10583 (2022).
- VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 935–947.
- Detection Is Better Than Cure: A Cloud Incidents Perspective. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1891–1902.
- An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 539–550.
- Dependency Aware Incident Linking in Large Cloud Systems. In Proceedings of The Web Conference (WWW).
- How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing. 126–141.
- Gemini Team Google. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805 (2023).
- Daniel S. Hirschberg. 1977. Algorithms for the Longest Common Subsequence Problem. J. ACM 24 (oct 1977), 664–675. https://doi.org/10.1145/322033.322044
- Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering. 1219–1231.
- Response time service level agreements for cloud-hosted web applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing. 315–328.
- How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1410–1420.
- Xpert: Empowering Incident Management with Query Recommendations via Large Language Models. arXiv preprint arXiv:2312.11988 (2023).
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5131–5140.
- NUBIA: NeUral Based Interchangeability Assessor for Text Generation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation. Association for Computational Linguistics, 28–37.
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.
- Chin-Yew Lin and Franz Josef Och. 2004. ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. COLING, 501–507.
- What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems. 155–162.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692 (2019). https://api.semanticscholar.org/CorpusID:198953378
- Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1583–1592.
- Using deep learning to generate complete log statements. In Proceedings of the 44th International Conference on Software Engineering. 2279–2290.
- Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 12–17.
- Jeffrey C Mogul and John Wilkes. 2019. Nines are not enough: Meaningful metrics for clouds. In Proceedings of the Workshop on Hot Topics in Operating Systems. 136–141.
- GMonE: A complete approach to cloud monitoring. Future Generation Computer Systems 29, 8 (2013), 2026–2040.
- Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2029–2038.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- OpenAI. 2023a. ChatGPT (Feb 13 version) [Large language model]. https://chat.openai.com.
- OpenAI. 2023b. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. https://api.semanticscholar.org/CorpusID:257532815
- Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
- Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering.
- Cloud Monitoring. Essentials of Cloud Computing: A Holistic Perspective (2019), 241–254.
- Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
- Sean Wolfe. 2018. Amazon’s one hour of downtime on Prime Day may have cost it up to $100 million in lost sales. https://www.businessinsider.com/amazon-prime-day-website-issues-cost-it-millions-in-lost-sales-2018-7
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed {{\{{Data-Intensive}}\}} Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 249–265.
- BERTScore: Evaluating Text Generation with BERT. arXiv preprint arXiv:1904.09675 (04 2019).
- Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4. arXiv preprint arXiv:2401.13810 (2024).
- Understanding and detecting software upgrade failures in distributed systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 116–131.