Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis

Published 12 Feb 2025 in cs.SE | (2502.08224v1)

Abstract: In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing LLMs as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper presents Flow-of-Action, which integrates SOP summarization with LLM-based multi-agent systems to guide the diagnostic process and reduce hallucinations.
It employs a comprehensive SOP flow that converts diagnostic procedures into executable code, significantly enhancing root cause analysis precision.
The system combines an action set mechanism with multi-agent collaboration to optimize fault identification and outperform traditional ReAct models.

Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis

Introduction

In the domain of microservices architecture, Root Cause Analysis (RCA) has become imperative for swiftly resolving incidents that occur frequently. A prominent trend involves utilizing LLMs as automated agents for RCA. Although frameworks like ReAct have been employed due to their alignment with Site Reliability Engineers' (SREs) thought-action-observation paradigms, they suffer from hallucinations leading to irrelevant actions, thereby affecting RCA outcomes. The complexity and variance of incident clues further exacerbate these challenges. The paper introduces Flow-of-Action, a novel Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system (MAS), which aims to improve RCA by imposing constraints on LLMs at crucial steps, thus guiding the diagnostic process effectively.

Figure 1: Illustration example of challenge 2.

Flow-of-Action Framework

Flow-of-Action integrates SOP summarization of SRE diagnostic steps, acting as a constraint mechanism guiding RCA processes towards accurate trajectories. To facilitate effective SOP utilization, SOP flow was designed, which encompasses tools for finding and generating SOPs, converting SOPs into executable code, and managing RCA processes. These tools significantly alleviate ReAct's hallucination issues in RCA tasks and enhance outcome accuracy from 35.50% to 64.01%.

Figure 2: Multimodal data collection and analysis.

Knowledge Base and Tools

Flow-of-Action employs a comprehensive knowledge base integrating SOP knowledge and historical incident details. SOP knowledge incorporates domain expertise and algorithmically extracted SOPs, while historical incidents contribute past performance manifestations to aid RCA. Agents utilize tools divided into multimodal data collection, SOP flow-related tools, and other analytical tools. Multimodal data collection, essential for handling diverse service data types, is enhanced by preprocessing structured data for LLM comprehension.

SOP Flow

The SOP flow establishes a structured logic chain guiding LLM action in RCA, with SOPs linked initially to fault types and facilitating hierarchical SOP application. LLMs generate new SOPs when unmatched incident types emerge. SOPs transcending traditional SOP text format are converted into executable code to achieve accurate RCA outcomes systematically. The prompt information for SOP flow is engineered to provide soft constraints, mitigating tool orchestration chaos while ensuring flexibility.

Figure 3: Comparison of ReAct and Flow-of-Action. RC means root cause. Dashed lines represent paths triggered under specific conditions.

Action Set Mechanism

To tackle challenges of diversified observations and numerous feasible actions, the paper introduces the action set mechanism where LLM generates a reasoned set of actions prior to selection, comprising flow-informed and judge-agent approved options. This approach balances randomness and determinism, thus optimizing the RCA process by preventing excessive trial-and-error and achieving more concise diagnostic paths.

Multi-Agent System Design

Flow-of-Action's MAS comprises multiple agents such as MainAgent, CodeAgent, JudgeAgent, ObAgent, and ActionAgent, each fulfilling specific roles across the RCA procedures. This design enhances process rationality, addresses information overload, and improves fault identification accuracy through cooperative multi-agent orchestration.

Evaluation

Experiments against various systems illustrate Flow-of-Action's superiority, notably achieving a diagnostic precision of 64.01%, significantly surpassing ReAct. Ablation studies reinforce the critical role of SOP enhancement, action sets, and multi-agent collaborations in optimizing RCA efficiencies and accuracies.

Figure 4: Example of Flow-of-Action.

Conclusion

The Flow-of-Action framework offers a refined approach to RCA by leveraging SOP-enhanced MAS. The integration of SOPs with LLMs significantly reduces orchestration inefficiencies and enhances diagnostic accuracy, which bears potential widespread implications for AI-enhanced IT operations. Future research may explore extending this framework's adaptability to diverse domains and scenarios within artificial intelligence applications.

References

To reference specific methodologies and experiments, see the detailed citation list provided in the paper [Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis, (2502.08224)].

Markdown Report Issue