RCAgent Framework: Autonomous Cloud RCA

Updated 7 November 2025

RCAgent Framework is a tool-augmented LLM-based autonomous agent system that integrates controller and expert agents for recursive cloud root cause analysis.
It employs advanced context observation management (OBSK) and output stabilization (JsonRegen) to ensure reliable, privacy-aware diagnostics in complex production environments.
Experimental results demonstrate RCAgent’s superior performance with lower hallucination and error rates compared to ReAct, validated through Alibaba Cloud deployments.

The RCAgent Framework is a tool-augmented LLM autonomous agent architecture designed for practical, privacy-aware, and highly autonomous cloud Root Cause Analysis (RCA) applications in complex production environments. RCAgent leverages internal, enterprise-hosted LLMs, domain-specific tooling, advanced context and trajectory management, and stability mechanisms to address core challenges in AI-driven RCA—delivering superior performance and reliability over previous multi-step prompting and agent architectures such as ReAct (Wang et al., 2023).

1. Architectural Design and Agentic Workflow

RCAgent employs a modular, two-tier agent structure. The core agent ("controller agent") implements a thought-action-observation loop, recursively planning diagnostic steps, invoking tools, and processing results in a sequential trajectory reminiscent of the ReAct paradigm. To address domain complexity, RCAgent integrates "expert agents"—LLM-driven tool modules specialized for analytics such as log or code analysis. All agents interact via structured JSON calls, with tight environmental integration to cloud observability APIs.

The control loop proceeds as:

Controller agent generates a "thought" (reasoning step).
Agent issues an "action" by invoking a tool (e.g., log or code analytics, data fetch) using a structured call.
Environment returns an "observation"—structured result or data snippet.
Cycle repeats with each (thought, action, observation) tuple appended to the prompt context.
Upon gathering sufficient evidence, the agent finalizes with a structured RCA report.

Tools are abstracted such that agents never compose raw SQL, query logs directly, or require low-level API knowledge—parameters are semantic, e.g., job ID, error class. This tool abstraction is critical for reliability and context control.

2. Key Technical Innovations

Tool-Augmentation and Domain Knowledge Integration

RCAgent's tool suite comprises:

Information-gathering tools for filtered, high-saliency evidence extraction.
LLM Expert Agents for recursive, domain-specialized analysis (e.g., multi-hop code analytics, log event clustering).

Each expert tool encapsulates embedded domain and enterprise knowledge—supporting recursive reasoning (e.g., code dependency tracing), robust log parsing (embedding-based clustering, RAG over log partitions), and evidence summarization.

Context and Observation Management (OBSK)

Context control is realized via "Observation Snapshot Key" (OBSK): only the leading part of each tool observation is injected into the agent’s prompt, the remainder replaced by a hashed snapshot key and stored in a key-value backend. When the agent re-invokes a related tool or analysis referencing such a key, the environment dynamically retrieves the full content. This enables the framework to handle cloud-scale logs and telemetry without overwhelming LLM context limits or upstreaming vast, redundant data into prompt history.

Output Stabilization (JsonRegen and Error Correction)

To counter unreliable outputs from privacy-preserving, locally-hosted LLMs:

JsonRegen ensures that every tool invocation and RCA report is formatted as valid JSON—invalid outputs are remediated by round-trip conversion through YAML with strict error masking and automatic retries.
Action error handling introduces explicit error signaling in the environment feedback, including duplicate stateless tool usage, insufficient parameterization, or accidental premature termination.

Self-Consistency for RCA Trajectories

RCAgent innovates by introducing trajectory-level self-consistency (TSC): instead of stochastic sampling entire agent trajectories (which is computationally costly and error-prone), the main diagnosis is run using deterministic (greedy) decoding for maximal stability. Only at the point of "finalize" are multiple outputs sampled, followed by aggregation (e.g., via embedding-vote or LLM-based summarization). The aggregation operation selects the answer: $\arg\max_i \operatorname{Sim}\left(\mathbf{a}_i, \frac{1}{K}\sum_{j=1}^K \mathbf{a}_j\right)$ where $\mathbf{a}_i$ is the embedding for candidate $i$ , $K$ is the sample pool. This mechanism maximizes output robustness with minimal computational overhead.

3. Comparison to ReAct and Other Agentic Frameworks

RCAgent extends the ReAct paradigm by:

Decoupling action planning from heavy in-context learning, instead leveraging domain-specific tools and expert agents.
Providing resilient context and observation management (OBSK), not present in ReAct or less systematic in agent frameworks oriented around flat action-observation chains.
Achieving substantially lower hallucination and invalid action rates: RCAgent’s pass rate is 99.4% (vs. ReAct’s 86.3%), invalid action rate is 7.9% (vs. 22.8%) under identical cloud RCA scenarios.

Unlike frameworks such as PentestMCP (Ezetta et al., 4 Oct 2025), which abstracts tools as MCP-backed RPC endpoints for agentic security workflows, RCAgent focuses on end-to-end RCA, emphasizing privacy-preserving LLM deployment, tool-augmented decision making, and cloud observability integration for industrial RCA use cases.

4. Experimental Results and Practical Integration

RCAgent demonstrates strong predictive performance and system stability:

On 161 real-world cloud RCA tasks, RCAgent w/TSC achieves higher similarity (METEOR 16.49), BLEURT (34.43), and human helpfulness ratings (2.92/5) compared to ReAct (METEOR 6.44, BLEURT 25.17, human 1.36/5).
RCAgent reports consistent superiority on root cause, solution, evidence, and responsibility prediction, as validated by both LLM-based and human eval.
In ablation studies, removing expert agents or context management features such as OBSK results in catastrophic accuracy drops and increased instability.
The system is deployed in Alibaba Cloud’s Flink platform, analyzing all previously intractable jobs (i.e., those not covered by existing SRE rules), with RCA outputs actively used for automated diagnosis and responsibility flagging.

Resource consumption scales linearly with data size, with performance remaining robust at production cloud scale.

5. Privacy, Scalability, and Deployment Considerations

The architecture is explicitly privacy-aware: RCAgent is designed around locally hosted models (e.g., Vicuna-13B-v1.5-16K, served via vLLM on a single GPU), with no external LLM API calls. All data, tools, and observation keys remain within enterprise boundaries.

Scalability is addressed via abstraction of tools, horizontal agent design, and context key management. Experiments and practical deployment confirm maintenance of performance and reliability at cloud production scale.

6. Limitations and Future Directions

While RCAgent brings measurable advancements in agentic RCA, its reliance on high-quality domain-specific tools and expert agent engineering implies that portability to domains lacking such infrastructure requires substantial upfront investment. The deterministic focus for stability may limit coverage of alternative diagnosis paths. Incorporation of more nuanced reflection, long-term memory across jobs/incidents, and advanced interactive exploration techniques (e.g., human-in-the-loop guidance) are identified as promising avenues for future framework evolution.

7. Summary Table: RCAgent Core Components and Features

Component/Mechanism	Description	Impact
Controller Agent	Autonomous, tool-augmented planner/actor	Orchestrates diagnosis loop
Expert Agents	Domain-specific analytic LLM modules	Evidence analysis & summarization
OBSK	Snapshot-keyed context store	Scalable prompt & memory handling
JsonRegen	Output stabilization, error correction	High semantic & syntactic validity
Trajectory SC	Output robustness via sample aggregation	Boosts stability, reduces errors
Deployment Integration	Alibaba Cloud Flink RCA workflow	Real-world impact validation

RCAgent thus represents a technically mature, fully integrated autonomous agent RCA framework focused on reliability, privacy, and large-scale operability for production cloud environments (Wang et al., 2023).

PDF Markdown Chat (Pro)

References (2)

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models (2023)

PentestMCP: A Toolkit for Agentic Penetration Testing (2025)

Follow Topic

Get notified by email when new papers are published related to RCAgent Framework.