A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends (2408.00803v1)
Abstract: The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.
- Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. arXiv preprint arXiv:2301.03797 (2023).
- Aliyun. 2022. Service Outage in Zone C of the China (Hong Kong) Region.
- Aliyun. 2023. Aliyun Cloud Product Console Service Anomaly.
- Anunay Amar and Peter C Rigby. 2019. Mining historical test logs to predict bugs and localize faults in the test logs. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 140–151.
- Amazon. 2021. Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region.
- X-ray: automating {{\{{Root-Cause}}\}} diagnosis of performance anomalies in production software. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 307–320.
- baidu baike. 2023. 11·27 Didi Outage Incident.
- DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services. (2019).
- trACE-Anomaly Correlation Engine for Tracing the Root Cause on Cloud Based Microservice Architecture. Computación y Sistemas 27, 3 (2023), 791–800.
- Tapan Behera and Kumud Tripathi. 2023. Root Cause Analysis Bot using Machine Learning Techniques. Authorea Preprints (2023).
- Bilibili. 2021. On July 13, 2021, we collapsed in this way.
- Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software 159 (2020), 110432.
- A real-time trace-level root-cause diagnosis system in alibaba datacenters. IEEE Access 7 (2019), 142692–142702.
- Causil: Causal graph for instance level microservice data. In Proceedings of the ACM Web Conference 2023. 2905–2915.
- BALANCE: Bayesian Linear Attribution for Root Cause Localization. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–26.
- Trace-based intelligent fault diagnosis for microservices with deep learning. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 884–893.
- Pinpoint: Problem determination in large, dynamic internet services. In Proceedings International Conference on Dependable Systems and Networks. IEEE, 595–604.
- Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 1887–1895.
- Empowering practical root cause analysis by large language models for cloud incidents. arXiv preprint arXiv:2305.15778 (2023).
- Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. (2024).
- Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges. arXiv preprint arXiv:2304.04661 (2023).
- Diagnosing the root-causes of failures from cluster log files. In 2010 International Conference on High Performance Computing. IEEE, 1–10.
- Event logs for the analysis of software failures: A rule-based approach. IEEE Transactions on Software Engineering 39, 6 (2012), 806–821.
- Tencent Cloud. 2024. Tencent Cloud Situation Report of the April 8th Outage.
- CNN. 2023. Twitter hit with one of the biggest outages since Elon Musk took over.
- Loglens: A real-time log analysis system. In 2018 IEEE 38th international conference on distributed computing systems (ICDCS). IEEE, 1052–1062.
- Mining historical issue repositories to heal large-scale online service systems. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 311–322.
- ENOW. 2023. Microsoft Outlook Suffers Major Outage.
- Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In 2014 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 103–112.
- FudanSELab. 2024. FudanSELab/train-ticket. https://github.com/FudanSELab/train-ticket Accessed: 2024-5-26.
- Sage: practical and scalable ML-driven performance debugging in microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 135–151.
- Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4. 324–337.
- Janos Gertler. 2017. Fault detection and diagnosis in engineering systems. CRC press.
- Graph-based trace analysis for microservice architecture understanding and problem diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1387–1397.
- Discovering Dynamic Dependencies in Enterprise Environments for Problem Determination. In IEEE International Workshop on Self-managing Distributed Systems.
- Pariket: Mining business process logs for root cause analysis of anomalous incidents. In Databases in Networked Information Systems: 10th International Workshop, DNIS 2015, Aizu-Wakamatsu, Japan, March 23-25, 2015. Proceedings 10. Springer, 244–263.
- helidon-sockshop. 2024. helidon-sockshop/sockshop. https://github.com/helidon-sockshop/sockshop Accessed: 2024-5-26.
- Diagnosing performance issues in microservices with heterogeneous data source. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 493–500.
- Root-cause diagnosis using logs generated by user actions. In 2018 IEEE Global Communications Conference (GLOBECOM). IEEE, 1–7.
- Logsed: Anomaly diagnosis through mining time-weighted control flow graph in logs. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD). IEEE, 447–455.
- Klaus Julisch. 2003. Clustering intrusion detection alarms to support root cause analysis. ACM transactions on information and system security (TISSEC) 6, 4 (2003), 443–471.
- Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review 41, 1 (2013), 93–104.
- Tomoyuki Koyama and Takayuki Kushida. 2023. Log message with JSON item count for root cause analysis in microservices. In 2023 6th Conference on Cloud and Internet of Things (CIoT). IEEE, 55–61.
- Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1750–1762.
- Causal inference-based root cause analysis for online service systems with intervention recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3230–3240.
- Mining Fluctuation Propagation Graph Among Time Series with Active Learning. In International Conference on Database and Expert Systems Applications. Springer, 220–233.
- ABC in Root Cause Analysis: Discovering Missing Information and Repairing System Failures. In International Conference on Machine Learning, Optimization, and Data Science. Springer, 346–359.
- Practical root cause localization for microservice systems via trace analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 1–10.
- Actionable and interpretable fault localization for recurring failures in online service systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 996–1008.
- Fast dimensional analysis for root cause investigation in a large-scale service environment. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4, 2 (2020), 1–23.
- Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Service-Oriented Computing: 16th International Conference, ICSOC 2018, Hangzhou, China, November 12-15, 2018, Proceedings 16. Springer, 3–20.
- Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. 102–111.
- Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture. In 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC). IEEE, 1–8.
- Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 338–347.
- MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting. In International Conference on Case-Based Reasoning. Springer, 224–239.
- Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 48–58.
- Lseino. 2024. lseino/onlineboutique. https://github.com/lseino/onlineboutique Accessed: 2024-05-26.
- Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 389–396.
- Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications. In 2019 IEEE International Conference on Web Services (ICWS). IEEE, 60–67.
- Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020. 246–258.
- Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment 13, 8 (2020), 1176–1189.
- Process mining on machine event logs for profiling abnormal behaviour and root cause analysis. Annals of Telecommunications 75, 9 (2020), 563–572.
- Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals. In Service-Oriented Computing–ICSOC 2020 Workshops: AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events, Dubai, United Arab Emirates, December 14–17, 2020, Proceedings, Vol. 12632. Springer Nature, 137.
- Localizing faults in cloud systems. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 262–273.
- Automatic failure diagnosis support in distributed large-scale software systems based on timing behavior anomaly correlation. In 2009 13th European Conference on Software Maintenance and Reengineering. IEEE, 47–58.
- Mashable. 2023. ChatGPT was down. What we know about the major outage.
- Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 1–10.
- Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs. Science China Information Sciences 55 (2012), 2757–2773.
- Ministry of Industry and Information Technology of China (Miit). 2023. Tencent Guangzhou Availability Zone Fault Incident.
- Anomaly detection from system tracing data using multimodal deep learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 179–186.
- Fchain: Toward black-box online fault localization for cloud systems. In 2013 IEEE 33rd International Conference on Distributed Computing Systems. IEEE, 21–30.
- Pal: Propagation-aware a nomaly l ocalization for cloud hosted distributed applications. In Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques. 1–8.
- Mining causality graph for automatic web-based service diagnosis. In 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC). IEEE, 1–8.
- LogRule: Efficient structured log mining for root cause analysis. IEEE Transactions on Network and Service Management (2023).
- OpenAI. 2024. Incidents of OpenAI.
- opensource-socialnetwork. 2024. opensource-socialnetwork/opensource-socialnetwork. https://github.com/opensource-socialnetwork/opensource-socialnetwork Accessed: 2024-05-26.
- A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Applied Sciences 10, 6 (2020), 2166.
- Root cause localization for unreproducible builds via causality analysis over system call tracing. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 527–538.
- Carl Martin Rosenberg and Leon Moonen. 2020. Spectrum-based log diagnosis. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12.
- Exploring LLM-based Agents for Root Cause Analysis. arXiv preprint arXiv:2403.04123 (2024).
- Amrita Saha and Steven CH Hoi. 2022. Mining root cause knowledge from cloud service incident investigations for AIOps. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 197–206.
- Areeg Samir and Claus Pahl. 2019. Dla: Detecting and localizing anomalies in containerized microservice architectures using markov models. In 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE, 205–213.
- Komal Sarda. 2023. Leveraging Large Language Models for Auto-remediation in Microservices Architecture. In 2023 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C). IEEE, 16–18.
- Adarma auto-detection and auto-remediation of microservice anomalies by leveraging large language models. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering. 200–205.
- Gwad: Greedy workflow graph anomaly detection framework for system traces. In 2020 IEEE International Conference on systems, man, and Cybernetics (SMC). IEEE, 2790–2796.
- ε𝜀\varepsilonitalic_ε-diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference. 3215–3222.
- SoftNER: Mining knowledge graphs from cloud incidents. Empirical Software Engineering 27, 4 (2022), 93.
- AutoTSG: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1477–1488.
- Julien Siebert. 2023. Applications of statistical causal inference in software engineering. Information and Software Technology (2023), 107198.
- Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR) 55, 3 (2022), 1–39.
- Failure root cause analysis for microservices, explained. In IFIP International Conference on Distributed Applications and Interoperable Systems. Springer, 74–91.
- Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).
- Mbarka Soualhia and Fetahi Wuhib. 2022. Automated traces-based anomaly detection and root cause analysis in cloud platforms. In 2022 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 253–260.
- Fault Root Rank Algorithm Based on Random Walk Mechanism in Fault Knowledge Graph. In 2021 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB). IEEE, 1–6.
- Root cause analysis with enriched process logs. In Business Process Management Workshops: BPM 2012 International Workshops, Tallinn, Estonia, September 3, 2012. Revised Papers 10. Springer, 174–186.
- Logan: Problem diagnosis in the cloud using log-based reference models. In 2016 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 62–67.
- HWLog Analysis: A Tool for Routers’ Syslog Anomaly Detection and Root Causes Diagnosis. In Artificial Intelligence Science and Technology: Proceedings of the 2016 International Conference (AIST2016). World Scientific, 799–806.
- Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. 14–27.
- Theregister. 2023. Google Drive misplaces months’ worth of customer files.
- Thurrott. 2024. Microsoft Bing Outage is Impacting Copilot, DuckDuckGo, And Other Services.
- UniSuper. 2024. A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian.
- Incremental causal graph learning for online root cause analysis. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2269–2278.
- Hierarchical graph neural networks for causal discovery and root cause localization. arXiv preprint arXiv:2302.01987 (2023).
- Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 419–429.
- Root-cause metric location for microservice systems via log anomaly detection. In 2020 IEEE international conference on web services (ICWS). IEEE, 142–150.
- Cloudranger: Root cause identification for cloud native systems. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 492–502.
- RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. arXiv preprint arXiv:2310.16340 (2023).
- Liqiang Wanga. 2018. LADRA: Log-Based Abnormal Task Detection and Root-Cause Analysis in Big Data Processing with Spark. (2018).
- Root cause analysis of anomalies of multitier services in public clouds. IEEE/ACM Transactions on Networking 26, 4 (2018), 1646–1659.
- MMRCA: multimodal root cause analysis. In International Conference on Service-Oriented Computing. Springer, 177–189.
- Performance diagnosis in cloud microservices using deep learning. In International Conference on Service-Oriented Computing. Springer, 85–96.
- Microrca: Root cause localization of performance issues in microservices. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium. IEEE, 1–9.
- Toward cognitive predictive maintenance: A survey of graph-based approaches. Journal of Manufacturing Systems 64 (2022), 107–120.
- Unsupervised Anomaly Detection on Microservice Traces through Graph VAE. In Proceedings of the ACM Web Conference 2023. 2874–2884.
- Logdc: Problem diagnosis for declartively-deployed cloud applications with log. In 2017 IEEE 14th International Conference on e-Business Engineering (ICEBE). IEEE, 282–287.
- CARE: Infusing causal aware thinking to root cause analysis in cloud system. In Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems. 1–3.
- Xueqiu. 2023. Vipshop Nansha Data Center Failure.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In WWW ’21: The Web Conference 2021.
- Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 553–565.
- TraceRank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems. Journal of Software: Evolution and Process 35, 10 (2023), e2413.
- CMDiagnostor: An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data. In Proceedings of the ACM Web Conference 2023. 2937–2947.
- Sherlog: error diagnosis by connecting clues from run-time logs. In Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems. 143–154.
- Log filtering and interpretation for root cause analysis. In 2010 IEEE International Conference on Software Maintenance. IEEE, 1–5.
- PACE: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis. arXiv preprint arXiv:2309.05833 (2023).
- Robust failure diagnosis of microservice system through multimodal data. IEEE Transactions on Services Computing (2023).
- CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4373–4382.
- {{\{{CRISP}}\}}: Critical Path Analysis of {{\{{Large-Scale}}\}} Microservice Architectures. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 655–672.
- Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162–171.
- Design and Implement of AIOps System Based on Knowledge Graph. In 2023 5th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT). IEEE, 285–288.
- Multi-modal Causal Structure Learning and Root Cause Analysis. arXiv preprint arXiv:2402.02357 (2024).
- Distance based root cause analysis and change impact analysis of performance regressions. Mathematical Problems in Engineering 2015 (2015).
- TraceStream: Anomalous Service Localization based on Trace Stream Clustering with Online Feedback. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 601–611.
- Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683–694.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.