Adaptability of LLM-Based Agents to Root Cause Analysis

Determine whether large language model-based agents can be effectively adapted to perform root cause analysis (RCA) of production incidents in cloud software systems.

Background

The paper argues that prior LLM-based RCA approaches lack the ability to dynamically collect diagnostic information (e.g., logs, metrics, and database queries) and thus face limitations in accurately diagnosing root causes. LLM-based agents promise improvements by reasoning, planning, and using tools to interact with external systems, but RCA requires specialized queries, domain knowledge, and access to confidential, out-of-distribution incident data.

Because agents typically need sophisticated prompting and sometimes fine-tuning or in-context examples to adapt to a domain, the authors highlight uncertainty about whether such agents can be adapted effectively to the RCA task. The study evaluates a ReAct agent in zero-shot settings and in a case study with specialized tools to shed light on this question.

References

Therefore, while LLM agents offer exceptional abilities that go far beyond prior approaches, it is unclear whether they can be effectively adapted to the RCA task.

Exploring LLM-based Agents for Root Cause Analysis  (2403.04123 - Roy et al., 2024) in Section 1, Introduction