Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment (1911.01225v2)

Published 1 Nov 2019 in cs.DC, cs.DB, cs.LG, and stat.ML

Abstract: Root cause analysis in a large-scale production environment is challenging due to the complexity of services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for understanding production issues. Another challenge in reviewing the logs for identifying issues is the scale - there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large-scale production datasets. We have successfully rolled out this approach for root cause investigation purposes in a large-scale infrastructure. We also present the setup and results from multiple production use cases in this paper.

Citations (32)

View on Semantic Scholar

Summary

The paper introduces a modified FP-Growth algorithm that uses divide-and-conquer strategies to efficiently mine significant item-sets for root cause analysis.
It employs a multi-step methodology combining real-time log de-duplication, one-hot encoding, and support-lift filtering to quickly identify actionable patterns.
The framework effectively detects hardware and software configuration failures, enabling near real-time RCA in complex service environments.

Overview of the Framework

The necessity of a robust Root Cause Analysis (RCA) tool arises in complex, global-scale service environments, where dissecting logs to understand system hiccups is a herculean task. In the midst of quintillions of operational data entries, engineers are often grappling with identifying significant patterns related to system failures. Traditional methods, while capable, are often slow, less interpretable, and not scalable to the magnitude required for such investigations.

Advancement in Pattern Analysis

Researchers from a notable technology company have enhanced traditional pattern analysis techniques by tailoring the Apriori algorithm to mine and analyze root causes in vast datasets efficiently. Apriori, historically celebrated for pattern identification, suffers from several inefficiencies, notably the runtime and burdensome candidate generation process. This new paper presents an adaptation of the FP-Growth algorithm, a well-known and more efficient alternative to Apriori. The custom FP-Growth harnesses divide-and-conquer strategies to expedite the identification of frequent item-sets, collections of feature values that characterize significant groups of data samples, such as hardware or software configurations associated with system failures.

Scalable Root Cause Investigation

For an effective RCA in large-scale systems, the researchers have developed a multi-pronged approach. Initially, they query and de-duplicate logs using Scuba, a real-time, in-memory database. The one-hot encoding method prepares the data for pattern mining, which is then processed by an initialized Apriori or an optimized version of the FP-Growth algorithm. Once the frequent patterns are identified, post-processing techniques filter these patterns by support and lift metrics to ensure only significant, actionable patterns are highlighted. Lift, in particular, hones in on patterns truly indicative of system issues as opposed to broadly present but non-informative data.

Results and Applications

The framework's deployment in varied real-world scenarios showcases its efficacy. For instance, it can pinpoint server reboot failures caused by specific firmware versions, or zero in on anomalous hardware and software configurations almost instantly. The framework has identified problematic patterns in systems managing SSH connections and service-to-service communications within large-scale infrastructures.

Conclusion and Future Directions

This paper establishes a fast, scalable approach to root cause analysis. It stands out for its near real-time investigation capabilities, interpretability, and adaptability to vast datasets. It offers a pragmatic solution to daunting data analysis challenges faced in massive service environments. Looking ahead, this research can pave the way for similar analysis on text-based reports and temporal data trends. By continuously iterating and optimizing based on the nature of the data and the specific needs of the analysis, such frameworks pave the way for more resilient and self-reliant systems.

PDF Markdown