Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset
The paper "Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset," authored by Nickolaos Koroniotis, Nour Moustafa, Elena Sitnikova, and Benjamin Turnbull, presents the development, implementation, and evaluation of a novel dataset called Bot-IoT. This dataset aims to address multiple challenges associated with existing network datasets by integrating both legitimate and simulated IoT network traffic alongside various types of botnet-based attacks.
Dataset Structure and Features
The Bot-IoT dataset was created using a realistic testbed environment configured within the Research Cyber Range lab at the University of New South Wales Canberra. The testbed encompasses multiple components: network platforms, simulated IoT services, and features extraction and forensic analytics tools. The network setup includes a combination of virtual machines (Kali Linux, Ubuntu, Windows) and normal network services (DNS, email, FTP, HTTP, SSH), alongside simulated IoT traffic generated using the Node-red tool. Importantly, the dataset leverages the Message Queuing Telemetry Transport (MQTT) protocol to emulate IoT communications.
Data was captured using the Argus tool, which logged network flows into MySQL tables. The primary features extracted from the network traffic consist of packet counts, byte counts, and various connection state metrics. New features were generated based on a sliding window of 100 connections, enhancing the capability to identify and separate normal and malicious traffic.
Statistical Analysis Techniques
To evaluate and refine the features, statistical measures including Pearson Correlation Coefficient and Shannon Joint Entropy were employed. These measures aid in identifying redundant features and ensuring a robust data subset for training machine learning models. The results indicate that features with high entropy and low average correlation are optimal, leading to the extraction of a subset comprising the 10-best features from the original dataset.
Machine Learning Evaluation
The paper employs machine and deep learning techniques to validate the dataset, using classifiers such as Support Vector Machine (SVM), Recurrent Neural Network (RNN), and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). The models were trained and evaluated using metrics such as Accuracy, Precision, Recall, and Fall-out across various subsets of the dataset.
SVM
The SVM model trained on the full-featured dataset achieved high accuracy (0.99988742) and recall (1), indicating its robustness in distinguishing between normal and attack traffic. Notably, the 10-best feature version demonstrated excellent precision (1) but slightly lower accuracy due to potential overfitting constraints.
LSTM and RNN
LSTM and RNN models, equipped to handle the temporal nature of network traffic, indicated high effectiveness in classifying traffic. Specifically, the LSTM model trained on the 10-best features showed considerable accuracy (0.9974194) and precision (0.99991036), albeit with slightly higher training times compared to other models. The RNN model followed a similar trajectory, although with a marginally lower fall-out than LSTM.
The discriminative power of the LSTM and RNN models was further scrutinized through detailed subcategory evaluations on different attack types like DDoS, DoS, and information theft attacks. While most attack types exhibited high classification metrics, data exfiltration scenarios posed challenges, revealing areas for further dataset refinement and model optimization.
Implications and Future Work
The Bot-IoT dataset holds substantial promise for advancing network forensic analytics, specifically within the context of IoT environments. Practically, this dataset addresses key limitations in existing datasets such as incomplete network information, inaccurate labeling, and limited attack diversity. Theoretically, it provides a foundation for developing sophisticated forensic models and intrusion detection systems capable of mitigating evolving cyber threats.
Future advancements could explore deeper optimization of machine and deep learning models to further enhance specificity and reduce false positive rates. Additionally, developing techniques that incorporate real-time streaming data from live IoT networks could provide more dynamic and adaptive intrusion detection capabilities.
Overall, the Bot-IoT dataset is a significant contribution to the network security domain, offering both a comprehensive benchmarking tool and a step towards more secure and resilient IoT ecosystems.