Distributed Log-driven Anomaly Detection System based on Evolving Decision Making

Published 3 Apr 2025 in cs.CR and cs.DC | (2504.02322v1)

Abstract: Effective anomaly detection from logs is crucial for enhancing cybersecurity defenses by enabling the early identification of threats. Despite advances in anomaly detection, existing systems often fall short in areas such as post-detection validation, scalability, and effective maintenance. These limitations not only hinder the detection of new threats but also impair overall system performance. To address these challenges, we propose CEDLog, a novel practical framework that integrates Elastic Weight Consolidation (EWC) for continual learning and implements distributed computing for scalable processing by integrating Apache Airflow and Dask. In CEDLog, anomalies are detected through the synthesis of Multi-layer Perceptron (MLP) and Graph Convolutional Networks (GCNs) using critical features present in event logs. Through comparisons with update strategies on large-scale datasets, we demonstrate the strengths of CEDLog, showcasing efficient updates and low false positives

Abstract PDF Chat (Pro)

Summary

The paper presents CEDLog, a framework that integrates distributed processing and continual learning to dynamically detect anomalies in system logs.
It employs dual-model detection using MLP and GCN, with decision fusion enhancing precision and reducing false positives.
The system incorporates human-in-the-loop validation and Elastic Weight Consolidation to maintain high accuracy over evolving log data.

This paper (2504.02322) presents CEDLog, a practical framework for distributed log-driven anomaly detection designed to address challenges in post-detection validation, scalability, and maintenance. CEDLog integrates distributed computing with continual learning to provide an efficient and evolving system for identifying security threats from system logs.

The core architecture of CEDLog involves several key components orchestrated to process logs from ingestion to anomaly alerting:

Log Integration and Transformation: The system uses the ELK stack (Elasticsearch, Logstash, Kibana) to collect, transform, and initially process logs from various sources. This standardizes logs into a structured format, typically JSON.
Log Parsing: A dedicated log parser converts the transformed, semi-structured logs into a tabular format. The paper utilizes the Drain algorithm [He2017DrainAO] for its efficiency ( $O(D)$ complexity, where $D$ is tree depth) and accuracy in extracting log components such as Datetime, Context, EventTemplate, RecordID, Log_Level, and ParameterList. The Drain algorithm works by clustering log messages based on matching tokens and wildcards (<*>) using a similarity score:

$\text{Similarity}(l_i, T_k) = \frac{\sum\limits_{j=1}^{m} \mathbb{I}(t_j= w_j \text{ or } t_j = \langle * \rangle)}{m}$

where $l_i$ is a log message, $T_k$ is a template, $m$ is the number of tokens, and $\mathbb{I}$ is the indicator function.
Feature Engineering: Relevant features are extracted from the parsed tabular data. The paper uses Random Forest [8074494] to determine feature importance scores ( $I$ $I$ ) for anomaly detection, selecting features ( $C$ $C$ ) above a threshold $\tau$ $τ$ .

$\mathcal{C} = \{c | c \in columns(L), I(c) \succ \tau\}$

A weight dictionary $W$ stores these importance scores for later use in decision fusion. Feature engineering creates two distinct input types for the detection models:
- A feature matrix $X$ containing all selected features except ParameterList.
- A graph representation where EventId is the root node and variables from ParameterList are leaf nodes, connected by edges. spaCy and pre-trained word embeddings (like GloVe) are used to embed token values into numeric vectors for the graph nodes, enhancing semantic understanding.
Scalable Processing: To handle large volumes of logs, feature generation is parallelized using Dask's map_partitions function [Rocklin2015DaskPC]. This distributes parsing and graph construction tasks across multiple CPU cores.
Dual-Model Detection: CEDLog employs two distinct models:
- Multi-layer Perceptron (MLP): This model processes the feature matrix $X$ (excluding ParameterList). It's designed to capture anomalies related to specific event templates, log levels, and other core components. The MLP architecture includes two hidden layers with Batch-Normalization.
- Graph Convolutional Network (GCN): This model operates on the graph representation derived from EventId and ParameterList. It focuses on detecting anomalies within variable values (e.g., extreme numeric values, unusual IP addresses). The GCN has two graph convolutional layers, a mean pooling layer, and two fully connected layers.
Decision Fusion: The predictions from the MLP and GCN are combined using a weighted fusion approach. The fusion weights ( $s_0$ , $s_1$ ) are derived from the feature importance scores ( $W$ ) calculated earlier, reflecting the relative importance of features used by each model.

$s_0 = \frac{W[\mathcal{C} \setminus \{"ParaList"\}]}{W_{sum}}, \quad s_1 = \frac{W["ParaList"]}{W_{sum}}$

The final anomaly score $F$ is calculated by weighting the probability estimates ( $P(p_1=0)$ for MLP, $P(p_2=0)$ for GCN, where 0 indicates 'normal'):

$F = P(p_1 = 0)* s_0 + P(p_2 = 0) * s_1$

A threshold (0.5) is applied to $F$ to make the final binary anomaly decision. This fusion enhances robustness by considering different aspects of the log data.
Human-in-the-Loop (HITL) Continual Learning: Anomaly predictions are sent to an analyst for validation. If a False Positive (FP) is identified, this feedback is used to update the models. To prevent catastrophic forgetting (degradation of performance on previously learned data when training on new data), Elastic Weight Consolidation (EWC) [Kutalev2021StabilizingEW] is integrated into the update pipeline. EWC adds a penalty term to the standard loss function ( $\mathcal{L}(\theta)$ ) based on the Fisher Information Matrix ( $F_i$ ), which measures parameter importance for previous tasks:

$\mathcal{L}_{\text{EWC} (\theta) = \mathcal{L}_(\theta) + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_i^*)^2$

where $\theta$ are current parameters, $\theta^*$ are parameters from the previous task, and $\lambda$ controls regularization strength.
Scalable Deployment: The entire workflow is orchestrated using Apache Airflow, defined as a Directed Acyclic Graph (DAG) of tasks (operators). For scalability and availability, Airflow is deployed in a distributed manner using the Celery Executor. This allows tasks to be queued and executed by multiple Celery workers across different nodes (Figure 1). The system uses Docker for packaging and deploying the offline training and online inference components.
Alerting: Detected anomalies, after potential HITL validation, are forwarded to ElasticAlert for generating notifications to clients.

The evaluation on BGL and HDFS datasets demonstrates the practical benefits of CEDLog. The decision fusion mechanism shows improved precision and significantly lower False Positive Rates (FPR) compared to using MLP or GCN alone (Table 1). The integration of EWC in continual learning helps maintain high accuracy and precision across different tasks, although the paper notes a slight increase in FPR on a new task compared to the initial task (Table 2), suggesting a potential trade-off in balancing knowledge retention.

Implementation Considerations:

Data Format: CEDLog relies on structured/semi-structured logs processed by ELK. Real-world deployment requires setting up robust ELK pipelines tailored to specific log sources.
Log Parsing Quality: The accuracy of anomaly detection heavily depends on the quality of the log parsing (Drain). Configuring and maintaining the Drain parser for diverse and evolving log formats is crucial.
Feature Engineering: The choice of features and the Random Forest importance calculation are critical. Adapting this process to different log sources and potential threats is necessary. Handling the ParameterList variations and defining semantically meaningful nodes for the GCN requires domain expertise.
Model Selection and Training: While the paper uses MLP and GCN, tuning hyperparameters and potentially exploring other models might be needed for specific datasets. Training requires labeled data, which can be a challenge in real-world anomaly detection.
Distributed Infrastructure: Deploying and managing a distributed Airflow setup with Celery workers and Dask requires expertise in these technologies. Monitoring and scaling these components based on log volume and processing load is essential.
Human-in-the-Loop Integration: Designing a user-friendly interface for analysts to validate anomalies and provide feedback is vital for the continual learning loop. The process for incorporating this feedback and triggering retraining needs to be automated.
EWC Configuration: Selecting the appropriate $\lambda$ hyperparameter for EWC requires experimentation to balance stability on old tasks with learning on new tasks. Computing and storing the Fisher Information Matrix can also have computational costs.
Resource Requirements: The paper mentions CPU efficiency but parallel processing on large datasets still requires significant CPU cores and memory depending on the scale. Disk space is needed for storing logs, parsed data, and model checkpoints.
Security: As a cybersecurity tool, the CEDLog framework itself needs to be deployed securely, including access control to the Airflow environment, ELK stack, and model artifacts.

CEDLog provides a blueprint for building a practical, scalable, and adaptive log anomaly detection system using a combination of proven distributed computing tools and modern machine learning techniques like continual learning and graph neural networks. The modular design, leveraging Airflow DAGs and Docker, facilitates deployment and maintenance. Future work mentioned includes integrating Kubernetes for better multi-client support and conducting comprehensive attack simulations for rigorous evaluation.