Papers
Topics
Authors
Recent
2000 character limit reached

MicroRCA-Agent: Root Cause Analysis

Updated 8 December 2025
  • MicroRCA-Agent is an advanced root cause analysis system that integrates LLM agents with multimodal data processing for pinpointing faults in microservice architectures.
  • It processes logs, traces, and metrics through modular pipelines, employing techniques like dual anomaly detection and statistical filtering for effective RCA evidence synthesis.
  • Key methodologies include reinforcement learning-based pruning and blockchain-inspired agent voting, enhancing processing accuracy and interpretability in fault diagnosis.

MicroRCA-Agent denotes a class of advanced root cause analysis (RCA) systems designed for microservice architectures, integrating LLM agents, multimodal event fusion, and agentic or collaborative reasoning paradigms. These systems address the fault localization problem in complex, heterogeneous distributed systems with a focus on automation, interpretability, robust data integration, and the capacity to process logs, traces, and metrics in concert. This entry synthesizes methodologies and architectures derived from the MicroRCA-Agent framework (Tang et al., 19 Sep 2025), multi-agent recursion-of-thought approaches (Zhang et al., 28 Aug 2025), reinforcement learning and causal inference as in TraceDiag (Ding et al., 2023), and blockchain-inspired voting designs such as mABC (Zhang et al., 18 Apr 2024).

1. System Architecture and Multimodal Data Processing

MicroRCA-Agent architectures characteristically comprise modular pipelines with the explicit goal of transforming voluminous, disparate microservice observability signals into structured, actionable RCA outputs. In the MicroRCA-Agent pipeline, five primary modules are responsible for progressive data processing:

  1. Data Preprocessing Module: Aligns start/end times (nanosecond precision) from the incident description, localizes relevant files, and normalizes timestamps across log, trace, and metric streams (Tang et al., 19 Sep 2025).
  2. Log Fault Extraction Module: Utilizes a pre-trained Drain parser—optimized via hyperparameters and filtering pipeline—to compress logs into high-quality templates, followed by deduplication and service mapping.
  3. Trace Fault Detection Module: Implements dual anomaly detection (Isolation Forest on sliding-window span durations and status code validation) to generate structured anomaly tables.
  4. Metric Fault Summarization Module: Applies statistical symmetry ratio filtering to select unstable metrics (where Rsym<0.05R_{\mathrm{sym}} < 0.05 triggers exclusion) and executes a two-stage LLM summarization: first at the service/pod level, then across the full-stack including infrastructural topology.
  5. Multimodal RCA Module: Fuses log- and trace-derived structural evidence with LLM-generated full-stack metric summaries using cross-modal prompts, outputting a JSON object with “component,” “reason,” and “reasoning_trace” fields.

Complementary agentic variants (e.g., RCLAgent, mABC) encapsulate these modules within separate agent roles—Data Agents, Thought Agents, and consensus orchestration entities—that iteratively reason, query, and validate candidate root causes across data modalities, with strict agent workflow control and loop-prevention (Zhang et al., 28 Aug 2025, Zhang et al., 18 Apr 2024).

2. Log and Trace Analysis Mechanisms

Log analysis is grounded in template-based structural compression:

  • Drain Parsers: Pre-trained on phase-specific error logs, Drain learns 156 distinct templates. During inference, new log events are tokenized, stripped of variable substrings, mapped to their best-matching template, and output as compacted “fault features” for downstream fusion (Tang et al., 19 Sep 2025).
  • Multistage Filters: The extraction pipeline sequentially applies file localization, windowing, error keyword matching, field projection, template matching, deduplication, frequency counting, and pod-to-service projection, as formalized in provided pseudocode.

Trace analysis employs a dual-channel anomaly identification:

  • Isolation Forest: For parent_pod–child_pod–operation trios, denote y=1y = -1 for anomalous activity (detected via the decision function threshold), accumulating counts over each 30 s window.
  • Status-Code Validation: Identifies outlier traces where the status code deviates from the expected (status.code ≠ 0), and groups anomalies for synthesis in subsequent multimodal LLM prompts.

Both anomaly streams independently contribute top-N ranked anomalies to the RCA evidence corpus, remaining separate throughout until LLM-based inference (Tang et al., 19 Sep 2025).

3. Statistical Filtering and Metric Summarization

To filter stable metrics and reduce inference overhead, the agent computes the symmetry ratio:

Rsym=MfaultMnormalMfault+Mnormal2+εR_{\mathrm{sym}} = \frac{\lvert M_{\mathrm{fault}} - M_{\mathrm{normal}}\rvert}{\frac{M_{\mathrm{fault}} + M_{\mathrm{normal}}}{2} + \varepsilon}

Metrics with Rsym<0.05R_{\mathrm{sym}} < 0.05 are dropped as stable, focusing LLM attention on impactful signals. Summarization proceeds in two passes:

  1. Service/Pod-Level: Aggregates APM and TiDB metrics across the time window, inputs JSON summaries to the LLM for phenomena description.
  2. Full-Stack Cascade: Adds infrastructural and pod-node topology metrics, prompting LLMs to deliver a ∼2k-word phenomenon summary, as required for downstream cross-modal reasoning (Tang et al., 19 Sep 2025).

Agent-based systems such as RCLAgent extend this approach to dynamic, agent-coordinated metric querying, where an Intermodal Agent infers which metrics (by entity and window) are relevant, applying n-sigma anomaly filtering, and avoiding metric overload in LLM context windows (Zhang et al., 28 Aug 2025).

4. Reasoning, Consensus, and Root Cause Inference

The inference core of MicroRCA-Agent is either:

  • LLM-Driven Structured Prompting: A one-shot prompt aggregates all extracted features, instructing the LLM to issue a constrained JSON result. Output enforcement includes regular expression post-processing and retries on failure. The reasoning trace is explicitly requested, supporting interpretability for SREs (Tang et al., 19 Sep 2025).
  • Multi-Agent Recursion-of-Thought: For RCLAgent, a central coordinator manages initial reasoning, critical reflection (recursively drilling into deeper trace/metric evidence), and a final consolidation phase. Agents issue reasoning instructions at each recursion depth, with filtered data responses guiding candidate expansion or backtracking. Outputs are finally formatted into standardized root-cause fields (Zhang et al., 28 Aug 2025).
  • Blockchain-Inspired Agent Voting (mABC): Seven specialized agents collaborate via a standardized workflow and decentralized voting. Agents accumulate contribution and expertise weights, with proposals accepted or re-queried subject to weighted consensus. Voting parameters (α=0.5\alpha = 0.5, β=0.5\beta = 0.5) ensure participation and support thresholds are met before advancing, mitigating hallucinations and ensuring transparent consensus (see Table below) (Zhang et al., 18 Apr 2024):
Agent Function Key Responsibilities
𝒜₁: Alert Receive/ Prioritize Select high-urgency alert
𝒜₂: Scheduler Decompose/Orchestrate Subtask generation, control
𝒜₃: Detective Data Ingestion Metric/log fetching
𝒜₄: Dependency Topology Inference Callgraph analysis
𝒜₅: Probability Fault Localization Failure probability scoring
𝒜₆: Fault Map Graph Update Fault web construction
𝒜₇: Solution Plan Synthesis RCA + resolution generation

5. Reinforcement Learning, Causal Analysis, and Explainability

Alternative MicroRCA-Agent realizations use automated graph pruning and causal inference:

  • Reinforcement Learning Pruning (TraceDiag): Uses a filtering tree, optimized via Proximal Policy Optimization, to iteratively prune irrelevant nodes from the dependency graph based on interpretable feature tests (e.g., latency percentiles, call statistics). MDP constraints and policy regularization encourage compact, interpretable pruning strategies (Ding et al., 2023).
  • Causal-Based RCA: After pruning, a structural causal model (SCM) is built for the subgraph; interventions (via do-calculus) estimate the counterfactual impact on the frontend node's performance. Node contributions are ranked by Shapley value or Average Treatment Effect (ATE), identifying root causes most responsible for anomalous system-wide behavior.
  • Explainability: Filtering trees and causal attributions are exportable, and can be visualized in dashboard or tabular form, supporting operational transparency (Ding et al., 2023).

6. Evaluation, Ablation Results, and Benchmarking

MicroRCA-Agent systems are evaluated on real and synthetic challenge benchmarks.

  • MicroRCA-Agent (Tang et al., 19 Sep 2025): On AIOps challenge data (phaseone/phasetwo split), three-modal configuration (logs, traces, metrics) achieves a final score of 50.71. Ablation shows metrics yield the strongest single-modal performance; log+metric fusion delivers highest dual-modal score (51.27). Empirically, trace data adds value primarily in complex call-chain fault scenarios.
  • RCLAgent (Zhang et al., 28 Aug 2025): On AIOPS 2022, Recall@1 spans 64.34–90.24% across six subsets, outperforming the mABC baseline (≈62.5%) by +15.6%. Ablation shows “Critical Reflection” phase offers a +13.3% recall boost.
  • mABC (Zhang et al., 18 Apr 2024): On Train-Ticket and AIOps, Root Cause Accuracy (RA) and Path Accuracy (PA) average to 69.3/60.4; ablation confirms Agent Workflow, multi-agent collaboration, and blockchain-inspired voting are critical for peak accuracy.
  • TraceDiag (Ding et al., 2023): RL pruning achieves a mean 98% graph reduction with 93% hit root cause. Full pipeline PR@Avg=0.834, RankScore=0.818 on Microsoft Exchange. End-to-end latency per incident is typically under 1 minute.

7. Practical Considerations and Integration

MicroRCA-Agent approaches are engineered for scalability and operational deployment:

  • Data Preprocessing: Timestamp normalization and windowed filtering facilitate robust file and event localization, supporting high-volume ingestion.
  • LLM and Agent Integration: LLM APIs are wrapped with prompt templates and failure/retry logic. Blockchains or recursive agents orchestrate evidence curation, workflow idempotency, and hallucination mitigation.
  • Adaptivity: Reinforcement learning policies and anomaly detection thresholds are periodically fine-tuned on new incident data to account for system drift.
  • Extensibility: Modular agent composition enables extension, e.g., plugging in log/topology/metric anomaly detectors, expanding agent roles, or integrating with SRE dashboards via API.

Limitations are primarily tied to LLM reliability, case base quality, and possible fixed agent set rigidity. A plausible implication is that continual learning and dynamic agent injection are promising directions for future research in this field (Zhang et al., 18 Apr 2024).


Key references:

  • “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on LLM Agents” (Tang et al., 19 Sep 2025)
  • “Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought” (Zhang et al., 28 Aug 2025)
  • “TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems” (Ding et al., 2023)
  • “mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture” (Zhang et al., 18 Apr 2024)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MicroRCA-Agent.