- The paper presents Flow-of-Action, which integrates SOP summarization with LLM-based multi-agent systems to guide the diagnostic process and reduce hallucinations.
- It employs a comprehensive SOP flow that converts diagnostic procedures into executable code, significantly enhancing root cause analysis precision.
- The system combines an action set mechanism with multi-agent collaboration to optimize fault identification and outperform traditional ReAct models.
Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis
Introduction
In the domain of microservices architecture, Root Cause Analysis (RCA) has become imperative for swiftly resolving incidents that occur frequently. A prominent trend involves utilizing LLMs as automated agents for RCA. Although frameworks like ReAct have been employed due to their alignment with Site Reliability Engineers' (SREs) thought-action-observation paradigms, they suffer from hallucinations leading to irrelevant actions, thereby affecting RCA outcomes. The complexity and variance of incident clues further exacerbate these challenges. The paper introduces Flow-of-Action, a novel Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system (MAS), which aims to improve RCA by imposing constraints on LLMs at crucial steps, thus guiding the diagnostic process effectively.
Figure 1: Illustration example of challenge 2.
Flow-of-Action Framework
Flow-of-Action integrates SOP summarization of SRE diagnostic steps, acting as a constraint mechanism guiding RCA processes towards accurate trajectories. To facilitate effective SOP utilization, SOP flow was designed, which encompasses tools for finding and generating SOPs, converting SOPs into executable code, and managing RCA processes. These tools significantly alleviate ReAct's hallucination issues in RCA tasks and enhance outcome accuracy from 35.50% to 64.01%.
Figure 2: Multimodal data collection and analysis.
Flow-of-Action employs a comprehensive knowledge base integrating SOP knowledge and historical incident details. SOP knowledge incorporates domain expertise and algorithmically extracted SOPs, while historical incidents contribute past performance manifestations to aid RCA. Agents utilize tools divided into multimodal data collection, SOP flow-related tools, and other analytical tools. Multimodal data collection, essential for handling diverse service data types, is enhanced by preprocessing structured data for LLM comprehension.
SOP Flow
The SOP flow establishes a structured logic chain guiding LLM action in RCA, with SOPs linked initially to fault types and facilitating hierarchical SOP application. LLMs generate new SOPs when unmatched incident types emerge. SOPs transcending traditional SOP text format are converted into executable code to achieve accurate RCA outcomes systematically. The prompt information for SOP flow is engineered to provide soft constraints, mitigating tool orchestration chaos while ensuring flexibility.
Figure 3: Comparison of ReAct and Flow-of-Action. RC means root cause. Dashed lines represent paths triggered under specific conditions.
Action Set Mechanism
To tackle challenges of diversified observations and numerous feasible actions, the paper introduces the action set mechanism where LLM generates a reasoned set of actions prior to selection, comprising flow-informed and judge-agent approved options. This approach balances randomness and determinism, thus optimizing the RCA process by preventing excessive trial-and-error and achieving more concise diagnostic paths.
Multi-Agent System Design
Flow-of-Action's MAS comprises multiple agents such as MainAgent, CodeAgent, JudgeAgent, ObAgent, and ActionAgent, each fulfilling specific roles across the RCA procedures. This design enhances process rationality, addresses information overload, and improves fault identification accuracy through cooperative multi-agent orchestration.
Evaluation
Experiments against various systems illustrate Flow-of-Action's superiority, notably achieving a diagnostic precision of 64.01%, significantly surpassing ReAct. Ablation studies reinforce the critical role of SOP enhancement, action sets, and multi-agent collaborations in optimizing RCA efficiencies and accuracies.
Figure 4: Example of Flow-of-Action.
Conclusion
The Flow-of-Action framework offers a refined approach to RCA by leveraging SOP-enhanced MAS. The integration of SOPs with LLMs significantly reduces orchestration inefficiencies and enhances diagnostic accuracy, which bears potential widespread implications for AI-enhanced IT operations. Future research may explore extending this framework's adaptability to diverse domains and scenarios within artificial intelligence applications.
References
To reference specific methodologies and experiments, see the detailed citation list provided in the paper [Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis, (2502.08224)].