KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks (2402.13264v1)
Abstract: Fault localization is challenging in online micro-service due to the wide variety of monitoring data volume, types, events and complex interdependencies in service and components. Faults events in services are propagative and can trigger a cascade of alerts in a short period of time. In the industry, fault localization is typically conducted manually by experienced personnel. This reliance on experience is unreliable and lacks automation. Different modules present information barriers during manual localization, making it difficult to quickly align during urgent faults. This inefficiency lags stability assurance to minimize fault detection and repair time. Though actionable methods aimed to automatic the process, the accuracy and efficiency are less than satisfactory. The precision of fault localization results is of paramount importance as it underpins engineers trust in the diagnostic conclusions, which are derived from multiple perspectives and offer comprehensive insights. Therefore, a more reliable method is required to automatically identify the associative relationships among fault events and propagation path. To achieve this, KGroot uses event knowledge and the correlation between events to perform root cause reasoning by integrating knowledge graphs and GCNs for RCA. FEKG is built based on historical data, an online graph is constructed in real-time when a failure event occurs, and the similarity between each knowledge graph and online graph is compared using GCNs to pinpoint the fault type through a ranking strategy. Comprehensive experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level. This performance matches the level of real-time fault diagnosis in the industrial environment and significantly surpasses state-of-the-art baselines in RCA in terms of effectiveness and efficiency.
- Aliyun, 2023. Aliyun notice. https://help.aliyun.com/noticelist/articleid/1064981333.html.
- DeCaf: diagnosing and triaging performance issues in large-scale cloud services, in: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 201–210.
- Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software 159, 110432.
- CAICT., 2023. Cloud computing white paper. http://www.caict.ac.cn/kxyj/qwfb/bps/202307/P020230725521473129120.pdf.
- Chaos-Mesh., 2022. Chaos mesh. https://github.com/chaos-mesh/chaos-mesh.
- BALANCE: bayesian linear attribution for root cause localization. ACM on Management of Data 1, 1–26.
- Adaptive performance anomaly detection for online service systems via pattern sketching, in: Proceedings of the 44th International Conference on Software Engineering, pp. 61–72.
- AI for IT operations (AIOps) on cloud platforms: reviews, opportunities and challenges. arXiv preprint arXiv:2304.04661 .
- Revelio: ML-generated debugging queries for finding root causes in distributed systems. Machine Learning and Systems 4, 601–622.
- Practical and scalable ml-driven cloud performance debugging with sage. IEEE Micro 42, 27–36.
- AIOps: Analysing cloud failure detection approaches for enhanced operational efficiency, in: Paper presented at the 2023 International Conference on Applied Intelligence and Sustainable Computing, pp. 1–6.
- Actionable and interpretable fault localization for recurring failures in online service systems, in: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 996–1008.
- Microhecl: High-efficient root cause localization in large-scale microservice systems, in: Paper presented at the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice, pp. 338–347.
- Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks, in: Paper presented at the 2020 IEEE 31st International Symposium on Software Reliability Engineering, pp. 48–58.
- Diagnosing root causes of intermittent slow queries in cloud databases, in: Proceedings of the VLDB Endowment, pp. 1176–1189.
- Miit., 2023. Tencent Guangzhou AZ Fault. https://wap.miit.gov.cn/jgsj/xgj/yjtxyhlht/art/2023/art_c67ae9c9a28c41618ff27794f87b1abc.html.
- A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Applied Sciences 10, 2166.
- Random forest. Journal of Insurance Medicine 47, 31–39.
- Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys 16, 1–85.
- Incremental causal graph learning for online root cause analysis, in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2269–2278.
- Interdependent causal networks for root cause localization, in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 5051–5060.
- Groot: An event-graph-based approach for root cause analysis in industrial settings, in: Paper presented at the 2021 36th IEEE/ACM International Conference on Automated Software Engineering, pp. 419–429.
- AIOps research innovations, performance impact and challenges faced. International Journal of System of Systems Engineering 13, 229–247.
- Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments, in: Proceedings of the Web Conference 2021, pp. 3087–3098.
- CMDiagnostor: An ambiguity-aware root cause localization approach based on call metric data, in: Proceedings of the ACM Web Conference 2023, pp. 2937–2947.
- CloudRCA: A root cause analysis framework for cloud computing platforms, in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management, pp. 4373–4382.
- Multi-stage location for root-cause metrics in online service systems, in: Paper presented at the NOMS 2023 IEEE/IFIP Network Operations and Management Symposium, pp. 1–9.
- Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering , 1–1.
- Machine learning. Springer Nature.
- Tingting Wang (33 papers)
- Guilin Qi (60 papers)
- Tianxing Wu (24 papers)