Dependency Aware Incident Linking in Large Cloud Systems (2403.18639v1)
Abstract: Despite significant reliability efforts, large-scale cloud services inevitably experience production incidents that can significantly impact service availability and customer's satisfaction. Worse, in many cases one incident can lead to multiple downstream failures due to cascading effects that creates several related incidents across different dependent services. Often time On-call Engineers (OCEs) examine these incidents in silos that lead to significant amount of manual toil and increase the overall time-to-mitigate incidents. Therefore, developing efficient incident linking models is of paramount importance for grouping related incidents into clusters so as to quickly resolve major outages and reduce on-call fatigue. Existing incident linking methods mostly leverages textual and contextual information of incidents (e.g., title, description, severity, impacted components), thus failing to leverage the inter-dependencies between services. In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads. Furthermore, we propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes. Extensive experimental results on real-world incidents from 5 workloads of Microsoft demonstrate that our alignment method has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We are also in the process of deploying this solution across 610 services from these 5 workloads for continuously supporting OCEs improving incident management and reducing manual toil.
- Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018).
- Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. arXiv:2301.03797 [cs.SE]
- Picking Pearl From Seabed: Extracting Artefacts from Noisy Issue Triaging Collaborative Conversations for Hybrid Cloud Services. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 12440–12446.
- DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
- An Empirical Investigation of Incident Triage for Online Service Systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120.
- Online summarizing alerts through semantic and behavior information. In Proceedings of the 44th International Conference on Software Engineering. 1646–1657.
- Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents. In EuroSys.
- Identifying linked incidents in large-scale online service systems. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 304–314.
- Graph-based incident aggregation for large-scale online service systems. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 430–442.
- Detection Is Better Than Cure: A Cloud Incidents Perspective. In ESEC/FSE (Industry Track).
- Scouts: Improving the diagnosis process through domain-customized incident routing. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 253–269.
- How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing. 126–141.
- John C Gower and Garmt B Dijksterhuis. 2004. Procrustes problems. Vol. 30. OUP Oxford.
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
- Efficient customer incident triage via linking with system incidents. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1296–1307.
- Do bugs propagate? an empirical analysis of temporal correlations among software bugs. In 35th European Conference on Object-Oriented Programming (ECOOP 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
- Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
- Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30, 2 (2011), 129–150.
- Abram Hindle and Curtis Onuczko. 2019. Preventing duplicate bug reports by continuously querying bug reports. Empirical Software Engineering 24, 2 (2019), 902–936.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
- Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Virginia Klema and Alan Laub. 1980. The singular value decomposition: Its computation and some applications. IEEE Transactions on automatic control 25, 2 (1980), 164–176.
- Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Lille.
- Unveiling clusters of events for alert and incident management in large-scale enterprise it. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1630–1639.
- Deeptriage: Automated transfer assistance for incidents in cloud services. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3281–3289.
- Neural knowledge extraction from cloud service incidents. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 218–227.
- A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 45–54.
- Improved duplicate bug report identification. In 2012 16th European conference on software maintenance and reengineering. IEEE, 385–390.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
- Sean Wolfe. 2018. Amazon’s one hour of downtime on Prime Day may have cost it up to $100 million in lost sales. https://www.businessinsider.com/amazon-prime-day-website-issues-cost-it-millions-in-lost-sales-2018-7
- Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162–171.
- Jian Zhou and Hongyu Zhang. 2012. Learning to rank duplicate bug reports. In Proceedings of the 21st ACM international conference on Information and knowledge management. 852–861.