Papers
Topics
Authors
Recent
2000 character limit reached

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture (2404.12135v3)

Published 18 Apr 2024 in cs.MA, cs.CR, and cs.DC

Abstract: Root cause analysis (RCA) in Micro-services architecture (MSA) with escalating complexity encounters complex challenges in maintaining system stability and efficiency due to fault propagation and circular dependencies among nodes. Diverse root cause analysis faults require multi-agents with diverse expertise. To mitigate the hallucination problem of LLMs, we design blockchain-inspired voting to ensure the reliability of the analysis by using a decentralized decision-making process. To avoid non-terminating loops led by common circular dependency in MSA, we objectively limit steps and standardize task processing through Agent Workflow. We propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), where multiple agents based on the powerful LLMs follow Agent Workflow and collaborate in blockchain-inspired voting. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. Our experiments on the AIOps challenge dataset and a newly created Train-Ticket dataset demonstrate superior performance in identifying root causes and generating effective resolutions. The ablation study further highlights Agent Workflow, multi-agent, and blockchain-inspired voting is crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and significantly improves the IT Operation domain. The code and dataset are in https://github.com/zwpride/mABC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. A. Alquraan, H. Takruri, M. Alfatafta, and S. Al-Kiswany, “An analysis of network-partitioning failures in cloud systems,” in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI’18), 2018.
  2. Y. Gao, W. Dou, F. Qin, C. Gao, D. Wang, J. Wei, R. Huang, L. Zhou, and Y. Wu, “An empirical study on crash recovery bugs in large-scale distributed systems,” in Proceedings of the 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE’18), 2018.
  3. Y. Zhang, J. Yang, Z. Jin, U. Sethi, K. Rodrigues, S. Lu, and D. Yuan, “Understanding and detecting software upgrade failures in distributed systems,” in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP’21), 2021.
  4. H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production cloud incidents?” in Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS’19), 2019.
  5. P. Jamshidi, C. Pahl, N. C. Mendonça, J. Lewis, and S. Tilkov, “Microservices: The journey so far and challenges ahead,” IEEE Software, vol. 35, no. 3, pp. 24–35, 2018.
  6. M. Kim, R. Sumbaly, and S. Shah, “Root cause detection in a service-oriented architecture,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 1, pp. 93–104, 2013.
  7. K. Wang, C. Fung, C. Ding, P. Pei, S. Huang, Z. Luan, and D. Qian, “A methodology for root-cause analysis in component based systems,” in 2015 IEEE 23rd International Symposium on Quality of Service (IWQoS).   IEEE, 2015, pp. 243–248.
  8. J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues with causal graphs in micro-service environments,” in Service-Oriented Computing: 16th International Conference, ICSOC 2018, Hangzhou, China, November 12-15, 2018, Proceedings 16.   Springer, 2018, pp. 3–20.
  9. P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y. Wang, and P. Chen, “Cloudranger: Root cause identification for cloud native systems,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).   IEEE, 2018, pp. 492–502.
  10. M. Ma, J. Xu, Y. Wang, P. Chen, Z. Zhang, and P. Wang, “Automap: Diagnose your microservice-based web applications automatically,” in Proceedings of The Web Conference 2020, 2020.
  11. M. Ma, Z. Yin, S. Zhang, S. Wang, C. Zheng, X. Jiang, H. Hu, C. Luo, Y. Li, N. Qiu et al., “Diagnosing root causes of intermittent slow queries in cloud databases,” Proceedings of the VLDB Endowment (VLDB’20), 2020.
  12. Y. Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, L. Fan, and M. Ke, “Cloudrca: a root cause analysis framework for cloud computing platforms,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021.
  13. S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to fight production incidents? an empirical study on a large-scale cloud service,” in Symposium on Cloud Computing, 2022, pp. 126–141.
  14. D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. Jain, and M. Stumm, “Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems.” in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14), 2014.
  15. T. Leesatapornwongsa, C. A. Stuardo, R. O. Suminto, H. Ke, J. F. Lukman, and H. S. Gunawi, “Scalability bugs: When 100-node testing is not enough,” in Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS’17), 2017.
  16. P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei, “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 48–58.
  17. X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 683–694.
  18. Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y. Wu, L. Jiang, L. Yan, Z. Wang, Z. Chen, W. Zhang, X. Nie, K. Sui, and D. Pei, “Practical root cause localization for microservice systems via trace analysis,” in 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), 2021, pp. 1–10.
  19. OpenAI, “Gpt-4 technical report,” 2023.
  20. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” 2022.
  21. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” arXiv preprint arXiv:2205.11916, 2022.
  22. J. Wei et al., “Chain of thought prompting elicits reasoning in large language models,” arXiv preprint arXiv:2201.11903, 2022.
  23. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
  24. N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems, 2023.
  25. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Advances in Neural Information Processing Systems, 2020, pp. 9459–9474.
  26. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” 2023.
  27. Z. Wang, Z. Liu, Y. Zhang, A. Zhong, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” 2023.
  28. Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “HuggingGPT: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint arXiv:2303.17580, 2023.
  29. Y. Chen et al., “Empowering cloud rca with augmented large language models,” arXiv preprint arXiv:2311.00000, 2023.
  30. X. Zhou, G. Li, Z. Sun, Z. Liu, W. Chen, J. Wu, J. Liu, R. Feng, and G. Zeng, “D-bot: Database diagnosis system using large language models,” 2023.
  31. H. Chen, W. Dou, Y. Jiang, and F. Qin, “Understanding exception-related bugs in large-scale cloud systems,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19), 2019.
  32. C. Lou, P. Huang, and S. Smith, “Understanding, detecting and localizing partial failures in large system software.” in Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI’20), 2020.
  33. Y. Liu, C. Pei, L. Xu, B. Chen, M. Sun, Z. Zhang, Y. Sun, S. Zhang, K. Wang, H. Zhang et al., “Opseval: A comprehensive task-oriented aiops benchmark for large language models,” arXiv preprint arXiv:2310.07637, 2023.
  34. M. Li, M. Ma, X. Nie, K. Yin, L. Cao, X. Wen, Z. Yuan, D. Wu, G. Li, W. Liu et al., “Mining fluctuation propagation graph among time series with active learning,” in Database and Expert Systems Applications: 33rd International Conference, 2022.
  35. H. Guo, X. Lin, J. Yang, Y. Zhuang, J. Bai, T. Zheng, B. Zhang, and Z. Li, “Translog: A unified transformer-based framework for log anomaly detection,” arXiv preprint arXiv:2201.00016, 2021.
  36. H. Guo, J. Yang, J. Liu, J. Bai, B. Wang, Z. Li, T. Zheng, B. Zhang, J. Peng, and Q. Tian, “Logformer: A pre-train and tuning pipeline for log anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 135–143.
  37. H. Guo, Y. Guo, R. Chen, J. Yang, J. Liu, Z. Li, T. Zheng, W. Hou, L. Zheng, and B. Zhang, “Loglg: Weakly supervised log anomaly detection via log-event graph construction,” 2023.
  38. H. Guo, J. Yang, J. Liu, L. Yang, L. Chai, J. Bai, J. Peng, X. Hu, C. Chen, D. Zhang et al., “Owl: A large language model for it operations,” arXiv preprint arXiv:2309.09298, 2023.
  39. W. Zhang, H. Guo, A. Le, J. Yang, J. Liu, Z. Li, T. Zheng, S. Xu, R. Zang, L. Zheng, and B. Zhang, “Lemur: Log parsing with entropy sampling and chain-of-thought merging,” 2024.
  40. S. Locke, H. Li, T.-H. P. Chen, W. Shang, and W. Liu, “Logassist: Assisting log analysis through log summarization,” IEEE Transactions on Software Engineering, vol. 48, no. 9, pp. 3227–3241, 2021.
  41. H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” in 2021 international joint conference on neural networks (IJCNN).   IEEE, 2021, pp. 1–8.
  42. Z. Jiang, J. Liu, Z. Chen, Y. Li, J. Huang, Y. Huo, P. He, J. Gu, and M. R. Lyu, “Llmparser: A llm-based log parsing framework,” arXiv preprint arXiv:2310.01796, 2023.
  43. A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” NIPS, 2017.
  44. J. Yang, S. Ma, D. Zhang, J. Wan, Z. Li, and M. Zhou, “Smart-start decoding for neural machine translation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3982–3988.
  45. J. Yang, S. Ma, D. Zhang, Z. Li, and M. Zhou, “Improving neural machine translation with soft template prediction,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 5979–5989.
  46. J. Yang, S. Ma, D. Zhang, S. Wu, Z. Li, and M. Zhou, “Alternating language modeling for cross-lingual pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 9386–9393.
  47. J. Yang, S. Ma, L. Dong, S. Huang, H. Huang, Y. Yin, D. Zhang, L. Yang, F. Wei, and Z. Li, “Ganlm: Encoder-decoder pre-training with an auxiliary discriminator,” arXiv preprint arXiv:2212.10218, 2022.
  48. L. Chai, J. Yang, T. Sun, H. Guo, J. Liu, B. Wang, X. Liang, J. Bai, T. Li, Q. Peng et al., “xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning,” arXiv preprint arXiv:2401.07037, 2024.
  49. A. Aghajanyan, L. Yu, A. Conneau, W. Hsu, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal, O. Levy, and L. Zettlemoyer, “Scaling laws for generative mixed-modal language models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 2023, pp. 265–279.
  50. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Ábrego, J. Ahn, J. Austin, P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catasta, Y. Cheng, C. Cherry, C. A. Choquette-Choo, A. Chowdhery, C. Crepy, S. Dave, M. Dehghani, S. Dev, J. Devlin, M. Díaz, N. Du, E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. Garcia, S. Gehrmann, L. Gonzalez, and et al., “Palm 2 technical report,” CoRR, vol. abs/2305.10403, 2023.
  51. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  52. M. Lewis et al., “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” ACL, 2020.
  53. T. B. Brown et al., “Language models are few-shot learners,” NeurIPS, 2020.
  54. A. Chowdhery et al., “Palm: Scaling language modeling with pathways,” ArXiv, 2022.
  55. J. Kaplan et al., “Scaling laws for neural language models,” ArXiv, 2020.
  56. L. Ouyang et al., “Training language models to follow instructions with human feedback,” arXiv preprint arXiv:2203.02155, 2022.
  57. J. H. Park et al., “Generative models as multi-agent systems,” Journal of Artificial Intelligence Research, 2023.
  58. L. Sumers et al., “Cognitive architectures for autonomous agents: A survey,” arXiv preprint arXiv:2302.00000, 2023.
  59. L. Qin et al., “Toolllm: Enhancing large language models with external tools for advanced problem solving,” arXiv preprint arXiv:2305.00000, 2023.
  60. M. Li et al., “Api-aware language modeling for micro-service architectures,” arXiv preprint arXiv:2308.00000, 2023.
  61. X. Jin et al., “Assessing the impact of large language models in cloud service root cause analysis,” arXiv preprint arXiv:2309.00000, 2023.
  62. X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Executable code actions elicit better llm agents,” 2024.
  63. A. Wang et al., “Interactive learning with autonomous agents and large language models,” arXiv preprint arXiv:2303.00000, 2023.
  64. Y. Zhou et al., “Llm-based autonomous agents for dynamic environments,” arXiv preprint arXiv:2304.00000, 2023.
  65. S. Ahmed et al., “Recommending fixes for cloud service failures with llm-enhanced tools,” arXiv preprint arXiv:2310.00000, 2023.
  66. Y. Chen, H. Xie, M. Ma, Y. Kang, X. Gao, L. Shi, Y. Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu, “Automatic root cause analysis via large language models for cloud incidents,” 2023.
  67. X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding, “Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study,” IEEE Transactions on Software Engineering, vol. 47, no. 2, pp. 243–260, 2018.
  68. B. Li, X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu, “Enjoy your observability: an industrial survey of microservice tracing and analysis,” Empirical Software Engineering, vol. 27, pp. 1–28, 2022.
  69. alibaba. (2021) https://github.com/chaosblade-io/chaosblade. [Online]. Available: https://github.com/chaosblade-io/chaosblade
  70. I. Abdallah, V. Dertimanis, H. Mylonas, K. Tatsis, E. Chatzi, N. Dervili, K. Worden, and E. Maguire, “Fault diagnosis of wind turbine structures using decision tree learning algorithms with big data,” in Safety and Reliability–Safe Societies in a Changing World.   CRC Press, 2018, pp. 3053–3061.
Citations (1)

Summary

  • The paper introduces mABC, a novel framework integrating multi-agent systems and blockchain-inspired voting to enhance root cause analysis in micro-services.
  • It leverages LLMs to empower specialized agents and uses decentralized voting based on historical accuracy for robust decision-making.
  • Experimental results on AIOps and Train-Ticket datasets demonstrate improved fault detection and propagation tracing compared to traditional methods.

mABC: A Multi-Agent Blockchain-Inspired Approach for Root Cause Analysis

Introduction

The research paper "mABC: Multi-Agent Blockchain-Inspired Collaboration for Root Cause Analysis in Micro-Services Architecture" explores the complexities of identifying root causes within micro-services architectures (MSA). As these distributed systems grow, maintaining their stability and efficiency becomes increasingly challenging. The paper introduces mABC, a framework leveraging multi-agent systems and LLMs, inspired by blockchain governance, to facilitate accurate and efficient root cause analysis (RCA).

Framework Overview

Multi-Agent and Blockchain-Inspired Collaboration

mABC employs a multi-agent approach where each agent specializes in a particular aspect of the RCA. The decentralized decision-making process is reminiscent of blockchain governance, where voting is utilized to exploit the transparency and egalitarian structure inherent in blockchain technology. This ensures that each agent's contribution is weighted based on its contribution and expertise indices. Figure 1

Figure 1: Example of root cause analysis in micro-services architecture (alert event arises on node A while alert event root cause node is I with fault propagation path I\toG\toD\toA).

Agent Workflow

The pipeline introduces specialized agents such as the Alert Receiver and Process Scheduler, which manage task elections and scheduling. The agents operate on a standardized workflow, distinguishing between simple direct responses and complex iterative processes for RCA.

Technical Workflow

The typical workflow begins with alert detection, followed by prioritization (by the Alert Receiver). The Process Scheduler then deconstructs the problem, distributing tasks to the respective agents like Data Detective and Dependency Explorer, employing distinct workflows (Figure 2). Figure 3

Figure 3: Overview of mABC. Overall pipeline encapsulates the flow from alert inception to root cause analysis within mABC.

Methodology

Agent Capabilities

The agents utilize LLMs to interpret and act on information. For example, the Probability Oracle evaluates failure probabilities, while the Solution Engineer synthesizes resolutions. Such capabilities ensure the framework's thoroughness in addressing complex RCAs.

Blockchain-Inspired Voting

The framework employs a blockchain-inspired mechanism to enhance decision accuracy. The voting process among agents affects decisions based on cumulative weights calculated from historical accuracy and participation levels, fostering a decentralized evaluation environment. Figure 2

Figure 2: Two distinct workflows of agent. ReAct answer involves an iterative cycle of thought, action, and observation.

Performance Evaluation

Through experiments on datasets such as the AIOps challenge dataset, mABC has demonstrated superior precision in RCA tasks, outperforming existing frameworks. Metrics such as Root Cause Result Accuracy and Root Cause Path Accuracy underscored its efficacy.

Experimental Results

Datasets

The framework was tested using both the AIOps challenge dataset and the Train-Ticket dataset, the latter being a complex micro-service-based train booking system. Multiple faults were simulated using tools such as ChaosBlade to generate diverse anomaly scenarios.

Evaluation Metrics and Outcomes

Outcomes were gauged using RA and PA, reflecting mABC's capacity to accurately pinpoint root causes and trace error propagation paths, showing superior performance over baseline models (Table summarizing key performances). Figure 4

Figure 4: Vote process on Agent Chain.

Conclusion

mABC introduces a comprehensive way to conduct RCA in micro-services architectures by fusing multi-agent systems and blockchain decision-making mechanisms. The innovative combination of LLMs and agent collaborations paves the way for enhancements in AIOps, addressing emerging challenges with a high degree of accuracy and efficiency. Future work should focus on enhancing agent collaboration strategies and exploring broader application scenarios, reinforcing the framework's applicability in complex IT environments worldwide.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 0 likes about this paper.