Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives (1811.05372v1)

Published 13 Nov 2018 in cs.LG, cs.AI, cs.CR, and stat.ML

Abstract: Machine Learning has been steadily gaining traction for its use in Anomaly-based Network Intrusion Detection Systems (A-NIDS). Research into this domain is frequently performed using the KDD~CUP~99 dataset as a benchmark. Several studies question its usability while constructing a contemporary NIDS, due to the skewed response distribution, non-stationarity, and failure to incorporate modern attacks. In this paper, we compare the performance for KDD-99 alternatives when trained using classification models commonly found in literature: Neural Network, Support Vector Machine, Decision Tree, Random Forest, Naive Bayes and K-Means. Applying the SMOTE oversampling technique and random undersampling, we create a balanced version of NSL-KDD and prove that skewed target classes in KDD-99 and NSL-KDD hamper the efficacy of classifiers on minority classes (U2R and R2L), leading to possible security risks. We explore UNSW-NB15, a modern substitute to KDD-99 with greater uniformity of pattern distribution. We benchmark this dataset before and after SMOTE oversampling to observe the effect on minority performance. Our results indicate that classifiers trained on UNSW-NB15 match or better the Weighted F1-Score of those trained on NSL-KDD and KDD-99 in the binary case, thus advocating UNSW-NB15 as a modern substitute to these datasets.

Citations (123)

View on Semantic Scholar

Summary

The paper identifies significant flaws in the widely used KDD CUP 99 dataset, arguing it is unsuitable for modern anomaly-based network intrusion detection training.
Using ML and balancing techniques like SMOTE, the study shows UNSW-NB15 improves performance on minority attack classes versus KDD-99 and NSL-KDD.
The authors recommend UNSW-NB15 as a superior alternative to KDD CUP 99 and NSL-KDD for training machine learning anomaly-based NIDS.

The paper "Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives" addresses the issue of using the KDD CUP 99 dataset as a benchmark for training machine learning models in Anomaly-based Network Intrusion Detection Systems (A-NIDS). The authors argue that KDD-99 has several weaknesses, including skewed response distribution, non-stationarity, and the failure to incorporate modern attacks, making it unsuitable for contemporary NIDS. They compare the performance of KDD-99 alternatives using common classification models: Neural Network, Support Vector Machine (SVM), Decision Tree, Random Forest, Naive Bayes, and K-Means. The paper explores the use of the Synthetic Minority Oversampling Technique (SMOTE) and random undersampling to create a balanced version of NSL-KDD, demonstrating that skewed target classes in KDD-99 and NSL-KDD impair the efficacy of classifiers on minority classes like U2R (User to Root) and R2L (Remote to Local). The authors also benchmark UNSW-NB15, a modern substitute for KDD-99, before and after SMOTE oversampling.

The paper begins by discussing the limitations of M-NIDS and A-NIDS, highlighting the advantages of machine learning-based A-NIDS in adapting to dynamic traffic patterns. The selection of a training dataset is identified as critical, with KDD-99 being a popular but flawed benchmark. The authors point out KDD-99's weaknesses, including its age, skewed targets, non-stationarity, pattern redundancy, and irrelevant features. They then introduce NSL-KDD as an improvement, offering a more balanced resampling of KDD-99, but note that it still fails to eliminate the poor performance of classifiers on minority classes. The authors empirically demonstrate that NSL-SMOTE, a balanced dataset created using oversampling and undersampling, significantly improves performance over NSL-KDD on minority classes. They also benchmark UNSW-NB15, a modern and balanced dataset, on standard machine learning classifiers and compare its performance to NB15-SMOTE, an oversampled version of UNSW-NB15.

The paper describes the characteristics of KDD-99, NSL-KDD, and UNSW-NB15.

Key aspects of KDD-99 include:

It is derived from the 1998 DARPA intrusion detection program.
It contains five classes of patterns: Normal, DoS (Denial of Service), U2R, R2L, and Probe.
It suffers from high redundancy, with only 1,074,992 unique data points in the training dataset out of 4,898,431.
Each pattern has 41 features, but only 24 are relevant based on Mean Decrease Impurity analysis.
The dataset is highly skewed, with 98.61% of the data belonging to either the Normal or DoS categories.
It exhibits non-stationarity between the training and test datasets.

NSL-KDD improves upon KDD-99 by:

Having fewer data points, all of which are unique.
Including an undersampling of the Normal, DoS, and Probe classes.
Having a more stationary sampling of KDD CUP 99.

UNSW-NB15 is presented as a modern alternative with:

A simulation using the IXIA PerfectStorm tool at the Australian Center of Cyber Security (ACCS).
Ten target classes: Normal, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shell Code, and Worms.
A train set containing 175,341 data points and a test set containing 82,332 data points, with no redundant data points.
49 features, reduced to 30 using Mean Decrease Impurity.
Lower skewedness compared to KDD-99.
Data stationarity is maintained between the training and test sets.

The authors outline their machine learning pipeline, including preprocessing, feature selection, training, and prediction stages. They use SMOTE and random undersampling to create the NSL-SMOTE and NB15-SMOTE datasets. The training stage yields a trained model, which is assessed on the feature-reduced test dataset. The Python library imbalance-learn is used for SMOTE and random undersampling. SMOTE is applied to NSL-KDD and UNSW-NB15 to create NSL-SMOTE and NB15-SMOTE, respectively, by oversampling minority classes. For NSL-SMOTE, random undersampling is then used to balance each class to 995 examples.

The paper details the machine learning models implemented, including Naive Bayes, SVM with a Radial Basis Function kernel, Decision Tree, Random Forest, Neural Network, and K-Means using Majority Vote. Hyperparameters are tuned using K-fold Cross Validation with $K=5$ .

Analysis metrics include Precision, Recall, F1-Score, Weighted F1-Score, and Null Error Rate (NER). The authors emphasize the importance of F1-score and Weighted F1-Score over accuracy due to the imbalanced nature of the datasets. The F1-score is calculated as:

$F1\textnormal{-}Score = \frac{2PR}{P+R}$

$P$ is Precision
$R$ is Recall

The Weighted F1-Score is calculated as:

$Weighted~F1\textnormal{-}Score = \frac{\sum\limits_{i=1}^{K} Support_i \cdot F1_i }{Total}$

$F1_i$ is the F1-Score predicted for the $i^{th}$ target class

The NER is calculated as:

$NER = 1 - \frac{Support_m}{Total}$

$Support_m$ is the number of examples in the majority class

The results section compares the performance of the classifiers on the different datasets. NSL-SMOTE shows significant improvements in F1-Score for the R2L and U2R classes compared to KDD-99 and NSL-KDD. The authors note a decrease in F1-Score for the DoS and Probe classes from KDD-99 to NSL-SMOTE, which they attribute to the inflated counts of DoS patterns in KDD-99.

For UNSW-NB15, the authors observe high performance on the Exploits, Generic, and Normal classes, which constitute over 73% of the training data. The Analysis, Backdoor, Shell Code, and Worms classes, which make up only 2.857% of the data, exhibit poor performance. The application of SMOTE improves the performance of these minority classes. When comparing binary versions of KDD-99, NSL-KDD, and UNSW-NB15, UNSW-NB15 equals or betters NSL-KDD on almost all learning models.

In the discussion section, the authors question the effectiveness of techniques optimized on KDD-99 due to its highly oblique distribution of patterns and non-stationarity. They highlight the unsatisfactory minority F1-Scores and the obscuring effect of binarization. They verify that dataset asymmetry plays a role in the performance of KDD-99's minority classes and suggest that additional legitimate attack patterns might benefit detection. The authors benchmark UNSW-NB15 and find that SMOTE oversampling improves the situation, especially for Analysis and Backdoor. They note that Random Forest performs best and Gaussian Naive Bayes is ineffectual. Binarizing the classes eliminates the problem of imbalance, and UNSW-NB15 equals or betters the Weighted F1-Score of NSL-KDD.

In conclusion, the authors advocate for the use of UNSW-NB15 as a satisfactory substitute for KDD CUP 99 and NSL-KDD for training machine learning anomaly-based NIDS. They suggest that future research could optimize performance by using alternative techniques across the machine learning pipeline, such as SMOTE variants, undersampling methods like EasyEnsemble and BalanceCascade, ensemble methods other than RandomForest, clustering techniques, unsupervised learning, and hybrid approaches.

PDF Markdown

Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives (1811.05372v1)

Summary

Related Papers