RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models (2310.16340v3)

Published 25 Oct 2023 in cs.SE and cs.CL

Abstract: LLM applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment interaction capabilities. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools. Our framework combines a variety of enhancements, including a unique Self-Consistency for action trajectories, and a suite of methods for context management, stabilization, and importing domain knowledge. Our experiments show RCAgent's evident and consistent superiority over ReAct across all aspects of RCA -- predicting root causes, solutions, evidence, and responsibilities -- and tasks covered or uncovered by current rules, as validated by both automated metrics and human evaluations. Furthermore, RCAgent has already been integrated into the diagnosis and issue discovery workflow of the Real-time Compute Platform for Apache Flink of Alibaba Cloud.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces RCAgent, a tool-augmented LLM agent framework for autonomous cloud root cause analysis, prioritizing practicality and data privacy.
RCAgent's architecture incorporates a Controller Agent, domain-specific Expert Agents using tools, and an OBSK for efficient context management.
Integrated in Alibaba Cloud, RCAgent significantly outperforms ReAct in experiments for root cause, solution, and evidence prediction, proving real-world effectiveness.

The paper "RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented LLMs" (2310.16340) introduces RCAgent, a framework designed for cloud RCA, leveraging LLMs. Unlike existing methods that depend on manual configurations, RCAgent utilizes the decision-making and environmental interaction capabilities of LLMs. The framework prioritizes practical application and data privacy, using internally deployed models rather than GPT models.

RCAgent Framework

RCAgent's architecture includes the following components:

Controller Agent: This is the core component, orchestrating actions within a thought-action-observation loop. It is guided by framework rules, task requirements, and tool documentation and uses JSON for data interchange.
Expert Agents (Analytical Tools): These extend the domain knowledge of the controller agent. Key expert agents include:
- Code Analysis Tool: This tool recursively searches code repositories, analyzes code files, and suggests relevant classes, providing summaries to the controller agent.
- Log Analysis Tool: This uses an in-context RAG approach, which splits logs into lines, constructs a graph to represent line relevance, clusters the graph for semantic partitioning, and employs an LLM to analyze each partition, extracting evidence for reliability.
Information-Gathering Tools: These tools facilitate data querying within the cloud environment, abstracting unnecessary details to simplify usage for the LLM. Examples include tools that use entity IDs as parameters, rather than requiring knowledge of SQL or Log query APIs.
Observation Snapshot Key (OBSK): This addresses context length limitations by providing the controller agent with only the head of the observation alongside a hash ID (snapshot key). A key-value store then maps the snapshot key to the complete observation data, which can be retrieved when needed.
Key-Value Store: This stores the mapping between the snapshot key and the full observation data.

Enhancements to RCAgent

RCAgent incorporates several enhancements to improve its performance and reliability:

Self-Consistency (SC) for Action Trajectories (Trajectory-level Self-Consistency - TSC): This method aggregates results from multiple sampled action trajectories. To reduce computational costs, sampling starts only when the controller agent is stepping into finalization. Two methods for aggregating text data are employed: Vote with embedding, which chooses the text result closest to the majority using semantic embeddings, and Aggregate with LLMs, which uses LLMs to summarize the candidate results.
JSON Repairing (JsonRegen): This ensures structured inference by repairing malformed JSON output from the LLM. This involves replacing sensitive characters, instructing the LLM to convert to YAML, and prompting the LLM to regenerate a valid JSON with the same structure and content.
Error Handling: Pre-defined criteria identify problematic actions as erroneous, providing error messages and suggestions to the controller agent to reduce meaningless actions, such as duplicate tool invocations or trivial input to expert agents.

Integration with Alibaba Cloud

RCAgent has been integrated into the diagnosis and issue discovery workflow of Alibaba Cloud's Real-time Compute Platform for Apache Flink. It is used to diagnose anomalous stream processing jobs not detected by existing methods. A feedback mechanism identifies issues in the PaaS and IaaS layers of the cloud system, providing insights for development teams.

Experimental Results

Experiments conducted on a dataset of anomalous jobs in Alibaba Cloud demonstrate RCAgent's superiority over the original ReAct framework:

Root Cause Prediction: RCAgent shows significant improvement in METEOR score and other metrics.
Solution Prediction: RCAgent shows significant improvement in METEOR score and other metrics.
Evidence Prediction: RCAgent shows significant improvement in METEOR score and other metrics.

Ablation studies highlight the contribution of each component of RCAgent, demonstrating the importance of LLM expert agents, JsonRegen, and OBSK. Self-Consistency with LLM summarization further enhances performance, particularly in solution prediction. The effectiveness of RCAgent in real-world scenarios is confirmed by results on Out-of-Domain (OoD) jobs in the deployed system, where human evaluators rated it as providing moderate support for RCA. It also demonstrates higher precision in determining the responsibility for the root cause compared to ReAct and other non-agent based RCA solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1820398351499567484