Reframing Threat Detection: Inside esINSIDER (1904.03584v1)

Published 7 Apr 2019 in cs.CR and cs.LG

Abstract: We describe the motivation and design for esINSIDER, an automated tool that detects potential persistent and insider threats in a network. esINSIDER aggregates clues from log data, over extended time periods, and proposes a small number of cases for human experts to review. The proposed cases package together related information so the analyst can see a bigger picture of what is happening, and their evidence includes internal network activity resembling reconnaissance and data collection. The core ideas are to 1) detect fundamental campaign behaviors by following data movements over extended time periods, 2) link together behaviors associated with different meta-goals, and 3) use machine learning to understand what activities are expected and consistent for each individual network. We call this approach campaign analytics because it focuses on the threat actor's campaign goals and the intrinsic steps to achieve them. Linking different campaign behaviors (internal reconnaissance, collection, exfiltration) reduces false positives from business-as-usual activities and creates opportunities to detect threats before a large exfiltration occurs. Machine learning makes it practical to deploy this approach by reducing the amount of tuning needed.

Citations (3)

View on Semantic Scholar

Collections

Summary

The paper introduces campaign analytics as its main contribution, shifting focus from individual alerts to adversary lifecycle behaviors.
It details an automated framework using Apache Spark and ML models to analyze multi-source log data and reveal low-and-slow threat activities.
The approach enhances interpretability and scalability by integrating reason codes and association graphs to prioritize high-risk hosts across multiple campaign stages.

This paper, "Reframing Threat Detection: Inside esINSIDER" (1904.03584), argues that traditional security approaches focusing on initial access, signatures, and individual alerts are insufficient against sophisticated persistent and insider threats. These threats operate covertly within a network for extended periods, and their specific tools and techniques change rapidly. The paper proposes a new approach called campaign analytics, which focuses on detecting the fundamental, unavoidable behaviors threat actors must perform to achieve their goals: internal reconnaissance, data collection/staging, and data exfiltration.

esINSIDER is presented as an automated tool implementing this campaign analytics approach. Its core design principles are:

Focus on Campaign Goals: Instead of discrete alerts, it looks for patterns of activity aligning with the adversary campaign lifecycle (stages 3-5).
Automation, ML, and Interpretability: It uses automated processes and ML to handle the scale of data and adapt to environments, while providing transparency (reason codes) for analysts.
Adapt to Changing Environments: It's designed to incorporate new data sources and continuously learn what's normal for a specific network, adapting to changes over time.
Hard for Threat Actor to Avoid: It uses multiple data sources, focuses on aggregating activities over time (making "low-and-slow" harder to hide), and designs ML models to resist manipulation or grandfathering malicious behavior as normal.

How esINSIDER Works (Implementation Details):

esINSIDER is a software-only product that processes log data from a data lake (like HDFS or S3) using Apache Spark on a distributed compute cluster, enabling horizontal scalability for large organizations (handling terabytes of logs). The process runs daily and involves several key steps:

Log Data to Monitoring Targets:
- Raw log data (flow, DNS, proxy logs, etc.) are ingested and processed through a multi-step pipeline: parsing, standardization (crucially mapping dynamic IPs to stable machine names), transformation (labeling records, e.g., internal/external traffic, semantic port groups), and aggregation.
- Aggregation occurs first daily, then combines daily aggregates into statistics over longer windows (weeks/months). This reduces data volume and helps detect cumulative "low-and-slow" activities.
- Monitoring Targets: These are quantifiable statistics derived from aggregated logs that represent fundamental campaign activities (e.g., number of distinct internal IPs contacted, total bytes collected from internal sources, total bytes published externally). The paper highlights using a small number (less than 25) of these targets, focused on behaviors.
Using Machine Learning to Understand Normal:
- For each monitoring target, an ML model (a "monitoring model") is trained daily on historical data to predict the expected value for each host.
- By comparing the actual measured value to the model's prediction and its associated probability distribution, esINSIDER determines how "surprising" or anomalous an activity is.
- Handling Dirty Data & Context: The ML is designed to avoid grandfathering existing malicious activity or benign outliers as normal. It uses shared models (not one per host), avoids unique host identifiers as inputs, requires minimum sample sizes for features, and specifically ignores irregular historical activity when using history as context.
- Models use contextual inputs to make predictions more nuanced and accurate, reducing false positives. Context includes host attributes (CIDR block, security groups for peer grouping), symmetric byte flow volumes, historical activity patterns (if regular), and communication patterns with common destinations.
- esINSIDER uses a custom non-parametric regression learning algorithm (detailed in the Appendix) that combines automated feature engineering with generalized linear models.
- Interpretability: The models are transparent. They include "reason codes" that explain which contextual inputs most strongly influenced a prediction, helping analysts understand why an activity is considered anomalous.
From Evidence to Cases:
- Stage Risk Scores: Outputs from monitoring models are combined using "ComboModels" (hierarchical expert ensembles) to compute aggregate risk scores for each host for each campaign stage (Recon, Collection, Exfiltration).
- Linking Across Stages: Devices are ranked based on a single score that integrates their stage risk scores. The geometric mean of stage ranks ( $\sqrt[3]{r_3 r_4 r_5}$ ) is used to prioritize devices showing high risk across multiple stages, as this is far less likely for legitimate activity than high risk in just one stage.
- Linking Across Machines: Starting with high-ranked hosts (seeds), esINSIDER builds candidate cases by examining related machines in an "association graph." This graph links machines showing interesting relationships, primarily surprising data movements.
- Case Generation: Candidate cases are filtered to keep only those exhibiting high risk activities for at least two campaign stages. Each final "case" packages the suspicious hosts, their risky activities, involved computers, and details, providing a narrative for human analysts to review.

Automated Learning Details (Appendix A):

The eSentire Learning Library (esLL) automates the ML model building. It takes labeled tabular data and uses a pipeline combining sophisticated data preprocessing ("feature engineering") with generalized linear models.

Preprocessing Pipeline: Data goes through dense table transforms (e.g., extracting features from timestamps, calculating rolling windows), encoding into a sparse feature matrix (using various strategies like Quantile Binning, Tree Encoding, One-Hot Encoding, Cluster String Encoding for different data types), and optional matrix transforms (like Interaction transforms for polynomial and interaction features).
Model Fitting: A linear model is fitted to the feature matrix using standard practices like ridge or LASSO regularization (tuned automatically) and standardization, typically using the L-BFGS optimization algorithm.
Automated Exploration: The complexity of choosing the best combination of transforms and encoders is managed by an automated exploration process. An "experiment planner" generates variations of the learning pipeline based on a user-provided blueprint (which can leave many details unspecified), and executors run these pipelines, evaluating the resulting models on tuning data. The planner iteratively refines experiments to find optimal configurations, automating what would traditionally be manual trial-and-error.

In summary, esINSIDER shifts the focus of threat detection from individual alerts and initial compromises to the observable, required steps of an adversary campaign. It leverages automated data processing and machine learning, trained specifically on the target network's data and incorporating context, to identify hosts exhibiting multiple campaign behaviors, packaging these correlated anomalies into high-value cases for analysts. This approach aims to be adaptable, transparent, and harder for sophisticated threat actors to evade compared to traditional methods.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Reframing Threat Detection: Inside esINSIDER (1904.03584v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (5)