mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture (2404.12135v3)

Published 18 Apr 2024 in cs.MA, cs.CR, and cs.DC

Abstract: Root cause analysis (RCA) in Micro-services architecture (MSA) with escalating complexity encounters complex challenges in maintaining system stability and efficiency due to fault propagation and circular dependencies among nodes. Diverse root cause analysis faults require multi-agents with diverse expertise. To mitigate the hallucination problem of LLMs, we design blockchain-inspired voting to ensure the reliability of the analysis by using a decentralized decision-making process. To avoid non-terminating loops led by common circular dependency in MSA, we objectively limit steps and standardize task processing through Agent Workflow. We propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), where multiple agents based on the powerful LLMs follow Agent Workflow and collaborate in blockchain-inspired voting. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. Our experiments on the AIOps challenge dataset and a newly created Train-Ticket dataset demonstrate superior performance in identifying root causes and generating effective resolutions. The ablation study further highlights Agent Workflow, multi-agent, and blockchain-inspired voting is crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and significantly improves the IT Operation domain. The code and dataset are in https://github.com/zwpride/mABC.

References (70)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces mABC, a novel framework integrating multi-agent systems and blockchain-inspired voting to enhance root cause analysis in micro-services.
It leverages LLMs to empower specialized agents and uses decentralized voting based on historical accuracy for robust decision-making.
Experimental results on AIOps and Train-Ticket datasets demonstrate improved fault detection and propagation tracing compared to traditional methods.

mABC: A Multi-Agent Blockchain-Inspired Approach for Root Cause Analysis

Introduction

The research paper "mABC: Multi-Agent Blockchain-Inspired Collaboration for Root Cause Analysis in Micro-Services Architecture" explores the complexities of identifying root causes within micro-services architectures (MSA). As these distributed systems grow, maintaining their stability and efficiency becomes increasingly challenging. The paper introduces mABC, a framework leveraging multi-agent systems and LLMs, inspired by blockchain governance, to facilitate accurate and efficient root cause analysis (RCA).

Framework Overview

Multi-Agent and Blockchain-Inspired Collaboration

mABC employs a multi-agent approach where each agent specializes in a particular aspect of the RCA. The decentralized decision-making process is reminiscent of blockchain governance, where voting is utilized to exploit the transparency and egalitarian structure inherent in blockchain technology. This ensures that each agent's contribution is weighted based on its contribution and expertise indices.

Figure 1: Example of root cause analysis in micro-services architecture (alert event arises on node A while alert event root cause node is I with fault propagation path I $\to$ G $\to$ D $\to$ A).

Agent Workflow

The pipeline introduces specialized agents such as the Alert Receiver and Process Scheduler, which manage task elections and scheduling. The agents operate on a standardized workflow, distinguishing between simple direct responses and complex iterative processes for RCA.

Technical Workflow

The typical workflow begins with alert detection, followed by prioritization (by the Alert Receiver). The Process Scheduler then deconstructs the problem, distributing tasks to the respective agents like Data Detective and Dependency Explorer, employing distinct workflows (Figure 2).

Figure 3: Overview of mABC. Overall pipeline encapsulates the flow from alert inception to root cause analysis within mABC.

Methodology

Agent Capabilities

The agents utilize LLMs to interpret and act on information. For example, the Probability Oracle evaluates failure probabilities, while the Solution Engineer synthesizes resolutions. Such capabilities ensure the framework's thoroughness in addressing complex RCAs.

Blockchain-Inspired Voting

The framework employs a blockchain-inspired mechanism to enhance decision accuracy. The voting process among agents affects decisions based on cumulative weights calculated from historical accuracy and participation levels, fostering a decentralized evaluation environment.

Figure 2: Two distinct workflows of agent. ReAct answer involves an iterative cycle of thought, action, and observation.

Performance Evaluation

Through experiments on datasets such as the AIOps challenge dataset, mABC has demonstrated superior precision in RCA tasks, outperforming existing frameworks. Metrics such as Root Cause Result Accuracy and Root Cause Path Accuracy underscored its efficacy.

Experimental Results

Datasets

The framework was tested using both the AIOps challenge dataset and the Train-Ticket dataset, the latter being a complex micro-service-based train booking system. Multiple faults were simulated using tools such as ChaosBlade to generate diverse anomaly scenarios.

Evaluation Metrics and Outcomes

Outcomes were gauged using RA and PA, reflecting mABC's capacity to accurately pinpoint root causes and trace error propagation paths, showing superior performance over baseline models (Table summarizing key performances).

Figure 4: Vote process on Agent Chain.

Conclusion

mABC introduces a comprehensive way to conduct RCA in micro-services architectures by fusing multi-agent systems and blockchain decision-making mechanisms. The innovative combination of LLMs and agent collaborations paves the way for enhancements in AIOps, addressing emerging challenges with a high degree of accuracy and efficiency. Future work should focus on enhancing agent collaboration strategies and exploring broader application scenarios, reinforcing the framework's applicability in complex IT environments worldwide.