Papers
Topics
Authors
Recent
Search
2000 character limit reached

Root Cause Analysis (RCA)

Updated 7 February 2026
  • Root Cause Analysis (RCA) is a systematic approach that identifies underlying causes of system failures in complex, interdependent infrastructures using data-driven techniques.
  • Recent advancements in RCA employ automated causal graph learning and multi-modal data integration to improve fault localization and reduce mean time to repair.
  • Benchmarking tools like LEMMA-RCA provide realistic, multi-domain datasets that validate RCA methodologies and drive improvements in reliability and performance.

Root Cause Analysis (RCA) is the process of identifying the fundamental drivers of system failures or performance degradations within complex, interdependent infrastructures. In contemporary environments—ranging from microservice-based IT stacks to operational technology (OT) systems—RCA is indispensable for minimizing repair times, controlling risk, enabling high reliability, and supporting safety-critical operations. Data-driven RCA, leveraging time-series metrics, logs, and automated causal graph learning, has superseded manual or ad hoc investigation, particularly given the massive data volumes, intricate causal dependencies, and the challenge of fault propagation across system layers. Innovation in RCA is closely tied to the availability of realistic, multi-modal, multi-domain benchmarks, with recent efforts such as LEMMA-RCA specifically addressing these requirements by providing a large-scale, open-source dataset covering both IT and OT domains (Zheng et al., 2024).

1. Conceptual Foundations and Challenges

Data-driven RCA entails observing entity-specific metrics (e.g., CPU, memory), logs, and system-level KPIs over time, then, upon detection of an anomaly in the KPI stream, localizing the top K entities (machines, containers, sensors, etc.) whose behavior most plausibly “caused” the observed failure. RCA is a critical operational tool as it directly reduces mean time to repair, limits impact, and supports service reliability, particularly in domains where rapid and accurate fault localization is vital (e.g., e-commerce, water treatment).

The primary challenges in RCA include:

  • Scale and Heterogeneity: Modern systems produce vast quantities of diverse data (e.g., 108+10^8+ log events per fault), often spanning numerous modalities (metrics, free-text logs, topology) and domains (IT, OT).
  • Causal Complexity: Faults may propagate non-obviously across layers, creating intertwined causal pathways resistant to naïve analysis.
  • Evaluation and Benchmarking: The paucity of large, publicly available, multi-modal datasets with real ground-truth (as opposed to synthetic faults) has stymied the scientific development of advanced RCA methods.

2. Data Modalities and Domains in RCA Benchmarks

The LEMMA-RCA benchmark embodies a multi-domain, multi-modal approach to dataset design (Zheng et al., 2024). Its salient features include:

  • Multi-domain scope:
    • IT (Information Technology): Product Review platform (6 OpenShift nodes, 216 pods, 4 fault types), Cloud Computing platform (11 EC2 nodes, 6 fault types).
    • OT (Operational Technology): SWaT (51 sensors, 16 attacks), WADI (123 sensors/actuators, 9 attacks).
  • Data Modalities:
    • Time-series metrics sampled per entity at one-second granularity, including fundamental resource and application-specific measurements.
    • Raw event logs, preprocessed via template extraction (Drain) and transformed into high-resolution template frequencies, golden-signal error counts, and compressed TF–IDF embeddings.
    • System-level KPI streams: For OT, anomaly scores from SVDD or Isolation Forest; for IT, measured latency or error rate.
  • Fault Injection and Annotation:
    • All 35 scenarios involve realistic, induced faults (e.g., DDoS, cryptojacking, storage full, sensor attacks), each annotated with onset timestamp and ground-truth root cause entity or entities.

This structure enables comprehensive cross-domain and cross-modality benchmarking and ensures that methods validated on LEMMA-RCA are exposed to both informational heterogeneity and realistic failure processes absent from narrower, synthetic, or single-modality datasets.

3. Methodological Landscape: Algorithms and Evaluation Protocols

State-of-the-art RCA methods span both classic causal-graph approaches and multi-modal machine learning frameworks. The methods evaluated on LEMMA-RCA (Zheng et al., 2024) include:

  • Single-modal, offline causal graph learning:
    • PC, Dynotears, C-LSTM, GOLEM, REASON (entity-level causal structure inference using only metrics or logs).
  • Multi-modal extensions:
    • MULAN, NeZha (integrating metrics and logs).
  • Online learning methods:
    • NOTEARS*, GOLEM*, CORAL (incremental or streaming causal graph adaptation).

RCA evaluation protocols are based on top-K root-cause identification, quantified using:

Metric Definition Significance
PR@K Precision@K Fraction of true causes in top-K predictions
MAP@K Mean Average Precision@K Average precision at ranks up to K
MRR Mean Reciprocal Rank Inverse rank of the first correct prediction (averaged)
P, R, F₁ Precision, Recall, F₁-score Classical IR metrics computed for top-K identified entities

These metrics enable fine-grained comparison of methods in both absolute (precision-oriented) and rank-sensitive regimes.

4. Benchmark Results and Quantitative Insights

Empirical evaluation on LEMMA-RCA demonstrates several key phenomena (Zheng et al., 2024):

  • Multi-Modality: Multi-modal models (e.g., MULAN) achieve perfect PR@1, indicating synergistic gains over single-modal methods (e.g., REASON, which is competitive using only metrics).
  • Metrics versus Logs: Metrics convey more diagnostic signal than logs, but log integration still improves overall performance.
  • Causal-graph Advantages: Methods modeling explicit interdependent causal networks (e.g., REASON) reliably outperform classical statistical-graph learners (PC, Dynotears, GOLEM), particularly in settings where edge dependencies are complex or rapidly shifting.
  • Online Methods: CORAL—using incremental disentangled causal graph learning—in online mode, aligns or surpasses the best offline methods while delivering lower response times and adaptive filtering of noise.

Selected results (reformatted from (Zheng et al., 2024)):

Model PR@1 (Product Review) PR@5 PR@10 MRR
Metric-REASON 75.0% 100% 100% 0.875
Log-REASON 0% 50% 75% 0.216
MulMOD-MULAN 100.0% 100% 100% 1.00
MulMOD-C-LSTM 50.0% 75% 75% 0.592

Similar superiority of multi-modal and advanced causal methods was observed in OT tasks (e.g., SWaT, WADI), although the overall scores decrease due to the brevity and variability of faults in industrial settings—underscoring the heightened complexity and realism (and thus the discriminatory power) of the benchmark.

5. Open Challenges, Future Directions, and Implications

Analysis of failure cases and performance plateaus surfaces open research challenges and directs future RCA efforts (Zheng et al., 2024):

  • Online Multi-Modal Fusion: Current online approaches, even when high-performing, often handle only single modalities or lack end-to-end architectures that can jointly process streaming logs and metrics. Future methods should implement real-time, multi-modal inference frameworks capable of robust streaming data assimilation.
  • Robustness to Missing or Non-Stationary Data: Non-stationary time series prompt the exclusion of some entities, reducing real-world applicability. Techniques for integrating partially observed, evolving, or dynamically missing modalities are necessary.
  • Broader Domain and Modal Extensions: Extending benchmarks (and, by implication, algorithms) to new domains—such as cybersecurity or healthcare—and to additional modalities (network traces, configurations) is essential for generality.
  • Transfer Learning and Domain Generalization: The presence of both IT and OT data in LEMMA-RCA invites models which leverage domain-invariant representations to enable more robust root cause localization even as systems evolve.
  • Empirical Demands: High overall performance (e.g., MULAN’s perfect PR@1) on some IT cases highlights the strength and alignment of LEMMA-RCA’s ground-truth labels and observable diagnostics. Nevertheless, the challenge presented by low-scoring industrial scenarios attests to the value of realistic signal-to-noise characteristics and motivates continued improvement.

6. Benchmark-Driven Scientific Progress

The release of multi-modal, multi-domain, and publicly available datasets such as LEMMA-RCA is catalyzing rapid advances in RCA methodology, algorithmic benchmarking, and real-world deployment (Zheng et al., 2024). The dataset's scale, complexity, and diversity enforce rigorous validation and foster the development of methods that are not only academically novel but also empirically robust. The presence of strong, observable signals alongside challenging, low-visibility settings ensures that future RCA research can meaningfully benchmark progress, innovate toward online and cross-domain fusion, and advance the science of automated failure diagnosis by grounding claims in realistic data.


References:

  • LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis (Zheng et al., 2024)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Root Cause Analysis (RCA).