Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset (1811.00701v1)

Published 2 Nov 2018 in cs.CR

Abstract: The proliferation of IoT systems, has seen them targeted by malicious third parties. To address this, realistic protection and investigation countermeasures need to be developed. Such countermeasures include network intrusion detection and network forensic systems. For that purpose, a well-structured and representative dataset is paramount for training and validating the credibility of the systems. Although there are several network, in most cases, not much information is given about the Botnet scenarios that were used. This paper, proposes a new dataset, Bot-IoT, which incorporates legitimate and simulated IoT network traffic, along with various types of attacks. We also present a realistic testbed environment for addressing the existing dataset drawbacks of capturing complete network information, accurate labeling, as well as recent and complex attack diversity. Finally, we evaluate the reliability of the BoT-IoT dataset using different statistical and machine learning methods for forensics purposes compared with the existing datasets. This work provides the baseline for allowing botnet identificaiton across IoT-specifc networks. The Bot-IoT dataset can be accessed at [1].

Authors (4)

Nickolaos Koroniotis (4 papers)
Nour Moustafa (23 papers)
Elena Sitnikova (5 papers)
Benjamin Turnbull (2 papers)

Citations (1,077)

View on Semantic Scholar

Summary

Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset

The paper "Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset," authored by Nickolaos Koroniotis, Nour Moustafa, Elena Sitnikova, and Benjamin Turnbull, presents the development, implementation, and evaluation of a novel dataset called Bot-IoT. This dataset aims to address multiple challenges associated with existing network datasets by integrating both legitimate and simulated IoT network traffic alongside various types of botnet-based attacks.

Dataset Structure and Features

The Bot-IoT dataset was created using a realistic testbed environment configured within the Research Cyber Range lab at the University of New South Wales Canberra. The testbed encompasses multiple components: network platforms, simulated IoT services, and features extraction and forensic analytics tools. The network setup includes a combination of virtual machines (Kali Linux, Ubuntu, Windows) and normal network services (DNS, email, FTP, HTTP, SSH), alongside simulated IoT traffic generated using the Node-red tool. Importantly, the dataset leverages the Message Queuing Telemetry Transport (MQTT) protocol to emulate IoT communications.

Data was captured using the Argus tool, which logged network flows into MySQL tables. The primary features extracted from the network traffic consist of packet counts, byte counts, and various connection state metrics. New features were generated based on a sliding window of 100 connections, enhancing the capability to identify and separate normal and malicious traffic.

Statistical Analysis Techniques

To evaluate and refine the features, statistical measures including Pearson Correlation Coefficient and Shannon Joint Entropy were employed. These measures aid in identifying redundant features and ensuring a robust data subset for training machine learning models. The results indicate that features with high entropy and low average correlation are optimal, leading to the extraction of a subset comprising the 10-best features from the original dataset.

Machine Learning Evaluation

The paper employs machine and deep learning techniques to validate the dataset, using classifiers such as Support Vector Machine (SVM), Recurrent Neural Network (RNN), and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). The models were trained and evaluated using metrics such as Accuracy, Precision, Recall, and Fall-out across various subsets of the dataset.

SVM

The SVM model trained on the full-featured dataset achieved high accuracy (0.99988742) and recall (1), indicating its robustness in distinguishing between normal and attack traffic. Notably, the 10-best feature version demonstrated excellent precision (1) but slightly lower accuracy due to potential overfitting constraints.

LSTM and RNN

LSTM and RNN models, equipped to handle the temporal nature of network traffic, indicated high effectiveness in classifying traffic. Specifically, the LSTM model trained on the 10-best features showed considerable accuracy (0.9974194) and precision (0.99991036), albeit with slightly higher training times compared to other models. The RNN model followed a similar trajectory, although with a marginally lower fall-out than LSTM.

The discriminative power of the LSTM and RNN models was further scrutinized through detailed subcategory evaluations on different attack types like DDoS, DoS, and information theft attacks. While most attack types exhibited high classification metrics, data exfiltration scenarios posed challenges, revealing areas for further dataset refinement and model optimization.

Implications and Future Work

The Bot-IoT dataset holds substantial promise for advancing network forensic analytics, specifically within the context of IoT environments. Practically, this dataset addresses key limitations in existing datasets such as incomplete network information, inaccurate labeling, and limited attack diversity. Theoretically, it provides a foundation for developing sophisticated forensic models and intrusion detection systems capable of mitigating evolving cyber threats.

Future advancements could explore deeper optimization of machine and deep learning models to further enhance specificity and reduce false positive rates. Additionally, developing techniques that incorporate real-time streaming data from live IoT networks could provide more dynamic and adaptive intrusion detection capabilities.

Overall, the Bot-IoT dataset is a significant contribution to the network security domain, offering both a comprehensive benchmarking tool and a step towards more secure and resilient IoT ecosystems.

PDF Markdown