RCInvestigator: Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems (2405.15571v1)
Abstract: Finding the root causes of anomalies in cloud computing systems quickly is crucial to ensure availability and efficiency since accurate root causes can guide engineers to take appropriate actions to address the anomalies and maintain customer satisfaction. However, it is difficult to investigate and identify the root causes based on large-scale and high-dimension monitoring data collected from complex cloud computing environments. Due to the inherently dynamic characteristics of cloud computing systems, the existing approaches in practice largely rely on manual analyses for flexibility and reliability, but massive unpredictable factors and high data complexity make the process time-consuming. Despite recent advances in automated detection and investigation approaches, the speed and quality of root cause analyses remain limited by the lack of expert involvement in these approaches. The limitations found in the current solutions motivate us to propose a visual analytics approach that facilitates the interactive investigation of the anomaly root causes in cloud computing systems. We identified three challenges, namely, a) modeling databases for the root cause investigation, b) inferring root causes from large-scale time series, and c) building comprehensible investigation results. In collaboration with domain experts, we addressed these challenges with RCInvestigator, a novel visual analytics system that establishes a tight collaboration between human and machine and assists experts in investigating the root causes of cloud computing system anomalies. We evaluated the effectiveness of RCInvestigator through two use cases based on real-world data and received positive feedback from experts.
- Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In Proceedings of the International Conference on Service-Oriented Computing (Workshop), pp. 137–149, 2020. doi: 10 . 1007/978-3-030-76352-7_17
- Visualizing time-oriented data — A systematic view. Computers & Graphics, 31(3):401–409, 2007. doi: 10 . 1016/j . cag . 2007 . 01 . 030
- Visualization of Time-Oriented Data. 2023. doi: 10 . 1007/978-1-4471-7527-8
- Time Curves: Folding time to visualize patterns of temporal evolution in data. IEEE Transactions on Visualization and Computer Graphics, 22(1):559–568, 2016. doi: 10 . 1109/TVCG . 2015 . 2467851
- In search of patient zero: Visual analytics of pathogen transmission pathways in hospitals. IEEE Transactions on Visualization and Computer Graphics, 27(2):711–721, 2021. doi: 10 . 1109/TVCG . 2020 . 3030437
- CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Transactions on Services Computing, 12(2):214–230, 2019. doi: 10 . 1109/TSC . 2016 . 2607739
- CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In Proceedings of the IEEE Conference on Computer Communications, pp. 1887–1895, 2014. doi: 10 . 1109/INFOCOM . 2014 . 6848128
- Dagre. Dagre: Directed graph layout for JavaScript. https://github.com/dagrejs/dagre. Last accessed: Nov 22, 2023.
- Graph based root cause analysis in cloud data center. In Proceedings of the IEEE International Conference of System of Systems Engineering, pp. 549–554, 2020. doi: 10 . 1109/SoSE50414 . 2020 . 9130526
- A survey of urban visual analytics: Advances and future directions. Computational Visual Media, 9(1):3–39, 2023. doi: 10 . 1007/s41095-022-0275-7
- A survey of time series data visualization research. In Proceedings of the IOP Conference Series: Materials Science and Engineering, vol. 782, pp. 1–10, 2020. doi: 10 . 1088/1757-899X/782/2/022013
- A survey of fault diagnosis and fault-tolerant techniques – Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Transactions on Industrial Electronics, 62(6):3757–3767, 2015. doi: 10 . 1109/TIE . 2015 . 2417501
- Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 292–303, 2020. doi: 10 . 1145/3368089 . 3409741
- Survey on visual analysis of event sequence data. IEEE Transactions on Visualization and Computer Graphics, 28(12):5091–5112, 2022. doi: 10 . 1109/TVCG . 2021 . 3100413
- T. Hagemann and K. Katsarou. A systematic review on anomaly detection for cloud computing environments. In Proceedings of the Artificial Intelligence and Cloud Computing Conference, pp. 83–96, 2020. doi: 10 . 1145/3442536 . 3442550
- The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47:98–115, Jan. 2015. doi: 10 . 1016/j . is . 2014 . 07 . 006
- Failure diagnosis of complex systems. In Resilience Assessment and Evaluation of Computing Systems, pp. 239–261. 2012. doi: 10 . 1007/978-3-642-29032-9_12
- ViSRE: A unified visual analysis dashboard for proactive cloud outage management. In Proceedings of the Working Conference on Software Visualization, pp. 5–16, 2022. doi: 10 . 1109/VISSOFT55257 . 2022 . 00010
- S. Kelly. Compromise: Modest natural language processing. https://github.com/spencermountain/compromise. Last accessed: Nov 22, 2023.
- Research challenges and prospective business impacts of cloud computing: A survey. In Proceedings of the IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, pp. 731–736, 2013. doi: 10 . 1109/IDAACS . 2013 . 6663021
- Root cause detection in a service-oriented architecture. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems, pp. 93–104, 2013. doi: 10 . 1145/2465529 . 2465753
- RetainVis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Transactions on Visualization and Computer Graphics, 25(1):299–309, 2019. doi: 10 . 1109/TVCG . 2018 . 2865027
- Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Proceedings of the International Conference on Service-Oriented Computing, pp. 3–20, 2018. doi: 10 . 1007/978-3-030-03596-9_1
- VizTree: A tool for visually mining and monitoring massive time series databases. In Proceedings of the International Conference on Very Large Data Bases, pp. 1269–1272, 2004. doi: 10 . 5555/1316689 . 1316811
- Automated anomaly detection and root cause analysis in virtualized cloud infrastructures. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium, pp. 550–556, 2016. doi: 10 . 1109/NOMS . 2016 . 7502857
- Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In Proceedings of the IEEE International Symposium on Software Reliability Engineering, pp. 48–58, 2020. doi: 10 . 1109/ISSRE5003 . 2020 . 00014
- ECoalVis: Visual analysis of control strategies in coal-fired power plants. IEEE Transactions on Visualization and Computer Graphics, 29(1):1091–1101, 2023. doi: 10 . 1109/TVCG . 2022 . 3209430
- Correlating events with time series for incident diagnosis. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pp. 1583–1592, 2014. doi: 10 . 1145/2623330 . 2623374
- Microsoft. Kusto query language (KQL) overview. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/. Last accessed: Sep 11, 2023.
- Visual analysis of cloud computing performance using behavioral lines. IEEE Transactions on Visualization and Computer Graphics, 22(6):1694–1704, 2016. doi: 10 . 1109/TVCG . 2016 . 2534558
- W. K. Muhlbauer. Risk: Theory and Application. In Pipeline Risk Management Manual (Third Edition), pp. 1–19. 2004. doi: 10 . 1016/B978-075067579-6/50004-2
- FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE International Conference on Distributed Computing Systems, pp. 21–30, 2013. doi: 10 . 1109/ICDCS . 2013 . 26
- PAL: Propagation-aware anomaly localization for cloud hosted distributed applications. In Proceedings of the ACM Symposium on Operating Systems Principles, pp. 1–8, 2011. doi: 10 . 1145/2038633 . 2038634
- Pallets. Welcome to Flask. https://flask.palletsprojects.com. Last accessed: Nov 22, 2023.
- N. Pandeeswari and G. Kumar. Anomaly detection system in cloud environment using fuzzy clustering based ANN. Mobile Networks and Applications, 21(3):494–505, 2016. doi: 10 . 1007/s11036-015-0644-x
- Clustering based incident handling for anomaly detection in cloud infrastructures. In Proceedings of the International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 611–616, 2020. doi: 10 . 1109/Confluence47617 . 2020 . 9058314
- Redpoll. A Bayesian change point library. https://pypi.org/project/changepoint/. Last accessed: Nov 22, 2023.
- VisRuption: Intuitive and efficient visualization of temporal airline disruption data. Computer Graphics Forum, 32:81–90, 2013. doi: 10 . 1111/cgf . 12095
- Traveler: Navigating task parallel traces for performance analysis. IEEE Transactions on Visualization and Computer Graphics, 29(1):788–797, 2023. doi: 10 . 1109/TVCG . 2022 . 3209375
- A. Samir and C. Pahl. DLA: Detecting and localizing anomalies in containerized microservice architectures using markov models. In Proceedings of the International Conference on Future Internet of Things and Cloud, pp. 205–213, 2019. doi: 10 . 1109/FiCloud . 2019 . 00036
- Design study methodology: Reflections from the trenches and the stacks. IEEE Transactions on Visualization and Computer Graphics, 18(12):2431–2440, 2012. doi: 10 . 1109/TVCG . 2012 . 213
- ϵitalic-ϵ\epsilonitalic_ϵ-Diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In Proceedings of the World Wide Web Conference, pp. 3215–3222, 2019. doi: 10 . 1145/3308558 . 3313653
- A visual analytics approach for hardware system monitoring with streaming functional data analysis. IEEE Transactions on Visualization and Computer Graphics, 28(6):2338–2349, 2022. doi: 10 . 1109/TVCG . 2022 . 3165348
- A survey of visualization systems for network security. IEEE Transactions on Visualization and Computer Graphics, 18(8):1313–1329, 2012. doi: 10 . 1109/TVCG . 2011 . 144
- J. Soldani and A. Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys, 55(3):59:1–59:39, Feb 2022. doi: 10 . 1145/3501297
- Survey on models and techniques for root-cause analysis. CoRR, abs/1701.08546, 2017. doi: 10 . 48550/arXiv . 1701 . 08546
- PlanningVis: A visual analytics approach to production planning in smart factories. IEEE Transactions on Visualization and Computer Graphics, 26(1):579–589, 2020. doi: 10 . 1109/TVCG . 2019 . 2934275
- Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the ACM/IFIP/USENIX Middleware Conference, pp. 14–27, 2017. doi: 10 . 1145/3135974 . 3135977
- Axes-based visualizations with radial layouts. In Proceedings of the ACM Symposium on Applied Computing, pp. 1242–1247, 2004. doi: 10 . 1145/967900 . 968153
- GRANO: Interactive graph-based root cause analysis for cloud-native distributed data platform. Proceedings of the VLDB Endowment, 12(12):1942–1945, 2019. doi: 10 . 14778/3352063 . 3352105
- Root-cause metric location for microservice systems via log anomaly detection. In Proceedings of the IEEE International Conference on Web Services, pp. 142–150, 2020. doi: 10 . 1109/ICWS49710 . 2020 . 00026
- Egocentric visual analysis of dynamic citation network. Journal of Visualization, 25(6):1343–1360, 2022. doi: 10 . 1007/s12650-022-00862-7
- S. Wastie. The real cost of downtime, the real potential of DevOps. AppDynamics, Jul 2018. Available: https://www.appdynamics.com/blog/engineering/idc-devops-cost-downtime/ (Last accessed: Nov 22, 2024).
- S. Wolfe. Amazon’s one hour of downtime on Prime Day may have cost it up to $100 million in lost sales. Business Insider, Jul 2018. Available: https://www.businessinsider.com/amazon-prime-day-website-issues-cost-it-millions-in-lost-sales-2018-7 (Last accessed: Mar 30, 2024).
- A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8):707–740, 2016. doi: 10 . 1109/TSE . 2016 . 2521368
- K. Wongsuphasawat and D. Gotz. Outflow: Visualizing patient flow by symptoms and outcome. In Proceedings of the IEEE VisWeek Workshop on Visual Analytics in Healthcare, pp. 25–28, 2011.
- MicroRCA: Root cause localization of performance issues in microservices. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium, pp. 1–9, 2020. doi: 10 . 1109/NOMS47738 . 2020 . 9110353
- A visual analytics approach for equipment condition monitoring in smart factories of process industry. In Proceedings of the IEEE Pacific Visualization Symposium, pp. 140–149, 2018. doi: 10 . 1109/PacificVis . 2018 . 00026
- CloudDet: Interactive visual analysis of anomalous performances in cloud computing systems. IEEE Transactions on Visualization and Computer Graphics, 26(1):1107–1117, Jan. 2020. doi: 10 . 1109/TVCG . 2019 . 2934613
- ViDX: Visual diagnostics of assembly line performance in smart factories. IEEE Transactions on Visualization and Computer Graphics, 23(1):291–300, 2017. doi: 10 . 1109/TVCG . 2016 . 2598664
- E. You. Vue.js: The Progressive JavaScript Framework. https://vuejs.org/. Last accessed: Nov 22, 2023.
- CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 4373–4382, 2021. doi: 10 . 1145/3459637 . 3481903
- A survey of visualization for smart manufacturing. Journal of Visualization, 22:419–435, 2019. doi: 10 . 1007/s12650-018-0530-2