Exploring LLM-based Agents for Root Cause Analysis (2403.04123v1)
Abstract: The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized LLMs to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.
- Recommending Root-Cause and mitigation steps for cloud incidents using large language models.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (Ann Arbor, Michigan, June 2005), Association for Computational Linguistics, pp. 65–72.
- Decaf: Diagnosing and triaging performance issues in large-scale cloud services. CoRR abs/1910.05339 (2019).
- The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, Aug. 1998), SIGIR ’98, Association for Computing Machinery, pp. 335–336.
- Chase, H. LangChain. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, HP d. O (2022).
- Teaching large language models to self-debug. arXiv preprint arXiv:2304. 05128 (2023).
- {{\{{Push-Button}}\}} reliability testing for {{\{{Cloud-Backed}}\}} applications with rainmaker. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (2023), pp. 1701–1716.
- Empowering practical root cause analysis by large language models for cloud incidents.
- PAL: Program-aided language models. 10764–10799.
- A systematic review on anomaly detection for cloud computing environments. In Proceedings of the 2020 3rd Artificial Intelligence and Cloud Computing Conference (New York, NY, USA, 2021), AICCC ’20, Association for Computing Machinery, p. 83–96.
- How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (New York, NY, USA, 2020), ESEC/FSE 2020, Association for Computing Machinery, p. 1410–1420.
- BART: Denoising Sequence-to-Sequence pre-training for natural language generation, translation, and comprehension.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 33 (2020), 9459–9474.
- Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (Barcelona, Spain, July 2004), Association for Computational Linguistics, pp. 74–81.
- {RESIN}: A holistic service for dealing with memory leaks in production cloud infrastructure. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (2022), pp. 109–125.
- Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2014), KDD ’14, Association for Computing Machinery, p. 1583–1592.
- Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings VLDB Endowment 13, 8 (Apr. 2020), 1176–1189.
- Self-Refine: Iterative refinement with Self-Feedback.
- Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics.
- Augmented language models: A survey.
- Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, Association for Computing Machinery, p. 2029–2038.
- OpenAI. GPT-4 technical report.
- BLEU: A method for automatic evaluation of machine translation. https://aclanthology.org/P02-1040.pdf, 2002. Accessed: 2023-9-27.
- Sentence-BERT: Sentence embeddings using siamese BERT-Networks.
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
- Reassessing automatic evaluation metrics for code summarization tasks. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (New York, NY, USA, Aug. 2021), ESEC/FSE 2021, Association for Computing Machinery, pp. 1105–1116.
- Toolformer: Language models can teach themselves to use tools.
- Autotsg: Learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (New York, NY, USA, 2022), ESEC/FSE 2022, Association for Computing Machinery, p. 1477–1488.
- Shinn, N. reflexion: Reflexion: Language agents with verbal reinforcement learning.
- Reflexion: an autonomous agent with dynamic memory and self-reflection.
- ALFWorld: Aligning text and embodied environments for interactive learning.
- Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Comput. Surv. 55, 3 (feb 2022).
- LLM-Planner: Few-Shot grounded planning for embodied agents with large language models.
- Automated traces-based anomaly detection and root cause analysis in cloud platforms. In 2022 IEEE International Conference on Cloud Engineering (IC2E) (2022), pp. 253–260.
- Interleaving retrieval with Chain-of-Thought reasoning for Knowledge-Intensive Multi-Step questions.
- Chain of thought prompting elicits reasoning in large language models.
- Chain-of-thought prompting elicits reasoning in large language models. 24824–24837.
- WebShop: Towards scalable real-world web interaction with grounded language agents. 20744–20757.
- ReAct: Synergizing reasoning and acting in language models.
- TraceArk: Towards actionable performance anomaly alerting for online service systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (May 2023), pp. 258–269.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv (2019).
- ExpeL: LLM agents are experiential learners.
- WebArena: A realistic web environment for building autonomous agents.