Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RCInvestigator: Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems (2405.15571v1)

Published 24 May 2024 in cs.HC

Abstract: Finding the root causes of anomalies in cloud computing systems quickly is crucial to ensure availability and efficiency since accurate root causes can guide engineers to take appropriate actions to address the anomalies and maintain customer satisfaction. However, it is difficult to investigate and identify the root causes based on large-scale and high-dimension monitoring data collected from complex cloud computing environments. Due to the inherently dynamic characteristics of cloud computing systems, the existing approaches in practice largely rely on manual analyses for flexibility and reliability, but massive unpredictable factors and high data complexity make the process time-consuming. Despite recent advances in automated detection and investigation approaches, the speed and quality of root cause analyses remain limited by the lack of expert involvement in these approaches. The limitations found in the current solutions motivate us to propose a visual analytics approach that facilitates the interactive investigation of the anomaly root causes in cloud computing systems. We identified three challenges, namely, a) modeling databases for the root cause investigation, b) inferring root causes from large-scale time series, and c) building comprehensible investigation results. In collaboration with domain experts, we addressed these challenges with RCInvestigator, a novel visual analytics system that establishes a tight collaboration between human and machine and assists experts in investigating the root causes of cloud computing system anomalies. We evaluated the effectiveness of RCInvestigator through two use cases based on real-world data and received positive feedback from experts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In Proceedings of the International Conference on Service-Oriented Computing (Workshop), pp. 137–149, 2020. doi: 10 . 1007/978-3-030-76352-7_17
  2. Visualizing time-oriented data — A systematic view. Computers & Graphics, 31(3):401–409, 2007. doi: 10 . 1016/j . cag . 2007 . 01 . 030
  3. Visualization of Time-Oriented Data. 2023. doi: 10 . 1007/978-1-4471-7527-8
  4. Time Curves: Folding time to visualize patterns of temporal evolution in data. IEEE Transactions on Visualization and Computer Graphics, 22(1):559–568, 2016. doi: 10 . 1109/TVCG . 2015 . 2467851
  5. In search of patient zero: Visual analytics of pathogen transmission pathways in hospitals. IEEE Transactions on Visualization and Computer Graphics, 27(2):711–721, 2021. doi: 10 . 1109/TVCG . 2020 . 3030437
  6. CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Transactions on Services Computing, 12(2):214–230, 2019. doi: 10 . 1109/TSC . 2016 . 2607739
  7. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In Proceedings of the IEEE Conference on Computer Communications, pp. 1887–1895, 2014. doi: 10 . 1109/INFOCOM . 2014 . 6848128
  8. Dagre. Dagre: Directed graph layout for JavaScript. https://github.com/dagrejs/dagre. Last accessed: Nov 22, 2023.
  9. Graph based root cause analysis in cloud data center. In Proceedings of the IEEE International Conference of System of Systems Engineering, pp. 549–554, 2020. doi: 10 . 1109/SoSE50414 . 2020 . 9130526
  10. A survey of urban visual analytics: Advances and future directions. Computational Visual Media, 9(1):3–39, 2023. doi: 10 . 1007/s41095-022-0275-7
  11. A survey of time series data visualization research. In Proceedings of the IOP Conference Series: Materials Science and Engineering, vol. 782, pp. 1–10, 2020. doi: 10 . 1088/1757-899X/782/2/022013
  12. A survey of fault diagnosis and fault-tolerant techniques – Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Transactions on Industrial Electronics, 62(6):3757–3767, 2015. doi: 10 . 1109/TIE . 2015 . 2417501
  13. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 292–303, 2020. doi: 10 . 1145/3368089 . 3409741
  14. Survey on visual analysis of event sequence data. IEEE Transactions on Visualization and Computer Graphics, 28(12):5091–5112, 2022. doi: 10 . 1109/TVCG . 2021 . 3100413
  15. T. Hagemann and K. Katsarou. A systematic review on anomaly detection for cloud computing environments. In Proceedings of the Artificial Intelligence and Cloud Computing Conference, pp. 83–96, 2020. doi: 10 . 1145/3442536 . 3442550
  16. The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47:98–115, Jan. 2015. doi: 10 . 1016/j . is . 2014 . 07 . 006
  17. Failure diagnosis of complex systems. In Resilience Assessment and Evaluation of Computing Systems, pp. 239–261. 2012. doi: 10 . 1007/978-3-642-29032-9_12
  18. ViSRE: A unified visual analysis dashboard for proactive cloud outage management. In Proceedings of the Working Conference on Software Visualization, pp. 5–16, 2022. doi: 10 . 1109/VISSOFT55257 . 2022 . 00010
  19. S. Kelly. Compromise: Modest natural language processing. https://github.com/spencermountain/compromise. Last accessed: Nov 22, 2023.
  20. Research challenges and prospective business impacts of cloud computing: A survey. In Proceedings of the IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, pp. 731–736, 2013. doi: 10 . 1109/IDAACS . 2013 . 6663021
  21. Root cause detection in a service-oriented architecture. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems, pp. 93–104, 2013. doi: 10 . 1145/2465529 . 2465753
  22. RetainVis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Transactions on Visualization and Computer Graphics, 25(1):299–309, 2019. doi: 10 . 1109/TVCG . 2018 . 2865027
  23. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Proceedings of the International Conference on Service-Oriented Computing, pp. 3–20, 2018. doi: 10 . 1007/978-3-030-03596-9_1
  24. VizTree: A tool for visually mining and monitoring massive time series databases. In Proceedings of the International Conference on Very Large Data Bases, pp. 1269–1272, 2004. doi: 10 . 5555/1316689 . 1316811
  25. Automated anomaly detection and root cause analysis in virtualized cloud infrastructures. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium, pp. 550–556, 2016. doi: 10 . 1109/NOMS . 2016 . 7502857
  26. Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In Proceedings of the IEEE International Symposium on Software Reliability Engineering, pp. 48–58, 2020. doi: 10 . 1109/ISSRE5003 . 2020 . 00014
  27. ECoalVis: Visual analysis of control strategies in coal-fired power plants. IEEE Transactions on Visualization and Computer Graphics, 29(1):1091–1101, 2023. doi: 10 . 1109/TVCG . 2022 . 3209430
  28. Correlating events with time series for incident diagnosis. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pp. 1583–1592, 2014. doi: 10 . 1145/2623330 . 2623374
  29. Microsoft. Kusto query language (KQL) overview. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/. Last accessed: Sep 11, 2023.
  30. Visual analysis of cloud computing performance using behavioral lines. IEEE Transactions on Visualization and Computer Graphics, 22(6):1694–1704, 2016. doi: 10 . 1109/TVCG . 2016 . 2534558
  31. W. K. Muhlbauer. Risk: Theory and Application. In Pipeline Risk Management Manual (Third Edition), pp. 1–19. 2004. doi: 10 . 1016/B978-075067579-6/50004-2
  32. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE International Conference on Distributed Computing Systems, pp. 21–30, 2013. doi: 10 . 1109/ICDCS . 2013 . 26
  33. PAL: Propagation-aware anomaly localization for cloud hosted distributed applications. In Proceedings of the ACM Symposium on Operating Systems Principles, pp. 1–8, 2011. doi: 10 . 1145/2038633 . 2038634
  34. Pallets. Welcome to Flask. https://flask.palletsprojects.com. Last accessed: Nov 22, 2023.
  35. N. Pandeeswari and G. Kumar. Anomaly detection system in cloud environment using fuzzy clustering based ANN. Mobile Networks and Applications, 21(3):494–505, 2016. doi: 10 . 1007/s11036-015-0644-x
  36. Clustering based incident handling for anomaly detection in cloud infrastructures. In Proceedings of the International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 611–616, 2020. doi: 10 . 1109/Confluence47617 . 2020 . 9058314
  37. Redpoll. A Bayesian change point library. https://pypi.org/project/changepoint/. Last accessed: Nov 22, 2023.
  38. VisRuption: Intuitive and efficient visualization of temporal airline disruption data. Computer Graphics Forum, 32:81–90, 2013. doi: 10 . 1111/cgf . 12095
  39. Traveler: Navigating task parallel traces for performance analysis. IEEE Transactions on Visualization and Computer Graphics, 29(1):788–797, 2023. doi: 10 . 1109/TVCG . 2022 . 3209375
  40. A. Samir and C. Pahl. DLA: Detecting and localizing anomalies in containerized microservice architectures using markov models. In Proceedings of the International Conference on Future Internet of Things and Cloud, pp. 205–213, 2019. doi: 10 . 1109/FiCloud . 2019 . 00036
  41. Design study methodology: Reflections from the trenches and the stacks. IEEE Transactions on Visualization and Computer Graphics, 18(12):2431–2440, 2012. doi: 10 . 1109/TVCG . 2012 . 213
  42. ϵitalic-ϵ\epsilonitalic_ϵ-Diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In Proceedings of the World Wide Web Conference, pp. 3215–3222, 2019. doi: 10 . 1145/3308558 . 3313653
  43. A visual analytics approach for hardware system monitoring with streaming functional data analysis. IEEE Transactions on Visualization and Computer Graphics, 28(6):2338–2349, 2022. doi: 10 . 1109/TVCG . 2022 . 3165348
  44. A survey of visualization systems for network security. IEEE Transactions on Visualization and Computer Graphics, 18(8):1313–1329, 2012. doi: 10 . 1109/TVCG . 2011 . 144
  45. J. Soldani and A. Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys, 55(3):59:1–59:39, Feb 2022. doi: 10 . 1145/3501297
  46. Survey on models and techniques for root-cause analysis. CoRR, abs/1701.08546, 2017. doi: 10 . 48550/arXiv . 1701 . 08546
  47. PlanningVis: A visual analytics approach to production planning in smart factories. IEEE Transactions on Visualization and Computer Graphics, 26(1):579–589, 2020. doi: 10 . 1109/TVCG . 2019 . 2934275
  48. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the ACM/IFIP/USENIX Middleware Conference, pp. 14–27, 2017. doi: 10 . 1145/3135974 . 3135977
  49. Axes-based visualizations with radial layouts. In Proceedings of the ACM Symposium on Applied Computing, pp. 1242–1247, 2004. doi: 10 . 1145/967900 . 968153
  50. GRANO: Interactive graph-based root cause analysis for cloud-native distributed data platform. Proceedings of the VLDB Endowment, 12(12):1942–1945, 2019. doi: 10 . 14778/3352063 . 3352105
  51. Root-cause metric location for microservice systems via log anomaly detection. In Proceedings of the IEEE International Conference on Web Services, pp. 142–150, 2020. doi: 10 . 1109/ICWS49710 . 2020 . 00026
  52. Egocentric visual analysis of dynamic citation network. Journal of Visualization, 25(6):1343–1360, 2022. doi: 10 . 1007/s12650-022-00862-7
  53. S. Wastie. The real cost of downtime, the real potential of DevOps. AppDynamics, Jul 2018. Available: https://www.appdynamics.com/blog/engineering/idc-devops-cost-downtime/ (Last accessed: Nov 22, 2024).
  54. S. Wolfe. Amazon’s one hour of downtime on Prime Day may have cost it up to $100 million in lost sales. Business Insider, Jul 2018. Available: https://www.businessinsider.com/amazon-prime-day-website-issues-cost-it-millions-in-lost-sales-2018-7 (Last accessed: Mar 30, 2024).
  55. A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8):707–740, 2016. doi: 10 . 1109/TSE . 2016 . 2521368
  56. K. Wongsuphasawat and D. Gotz. Outflow: Visualizing patient flow by symptoms and outcome. In Proceedings of the IEEE VisWeek Workshop on Visual Analytics in Healthcare, pp. 25–28, 2011.
  57. MicroRCA: Root cause localization of performance issues in microservices. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium, pp. 1–9, 2020. doi: 10 . 1109/NOMS47738 . 2020 . 9110353
  58. A visual analytics approach for equipment condition monitoring in smart factories of process industry. In Proceedings of the IEEE Pacific Visualization Symposium, pp. 140–149, 2018. doi: 10 . 1109/PacificVis . 2018 . 00026
  59. CloudDet: Interactive visual analysis of anomalous performances in cloud computing systems. IEEE Transactions on Visualization and Computer Graphics, 26(1):1107–1117, Jan. 2020. doi: 10 . 1109/TVCG . 2019 . 2934613
  60. ViDX: Visual diagnostics of assembly line performance in smart factories. IEEE Transactions on Visualization and Computer Graphics, 23(1):291–300, 2017. doi: 10 . 1109/TVCG . 2016 . 2598664
  61. E. You. Vue.js: The Progressive JavaScript Framework. https://vuejs.org/. Last accessed: Nov 22, 2023.
  62. CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 4373–4382, 2021. doi: 10 . 1145/3459637 . 3481903
  63. A survey of visualization for smart manufacturing. Journal of Visualization, 22:419–435, 2019. doi: 10 . 1007/s12650-018-0530-2

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets