Bot-IoT Dataset for IoT Intrusion Detection

Updated 20 November 2025

Bot-IoT dataset is a large-scale, labeled collection of IoT network flows used to benchmark intrusion detection techniques.
It incorporates time-synchronized ground truth and a detailed label taxonomy covering diverse attack scenarios such as DDoS, DoS, reconnaissance, and more.
It supports comprehensive evaluation pipelines with advanced feature engineering, SMOTE-based imbalance mitigation, and both classical ML and deep learning models.

The Bot-IoT dataset is a large-scale, labeled collection of network flow records explicitly designed to benchmark intrusion detection and forensic analytics for IoT networks. Created by Koroniotis et al. at UNSW Canberra Cyber, it simulates a diverse IoT environment incorporating both benign and a wide array of malicious activities, with realistic protocol mixes and attack methodologies. The dataset addresses major gaps in prior corpora: providing complete flow capture, time-synchronized ground truth, and a detailed label taxonomy spanning multiple botnet scenarios. It offers more than 72 million records annotated at the flow level, with per-record features supporting statistical, behavioral, and deep learning-based detection paradigms (Koroniotis et al., 2018).

1. Dataset Generation and Experimentation Strategy

The physical testbed was constructed using VMware ESXi infrastructure, a pfSense virtual firewall, and a tap host for packet mirroring. Node-RED simulated periodic benign IoT traffic for devices such as weather stations, thermostats, smart lights, and refrigerators. Service VMs provided common protocols (DNS, MQTT, HTTP, FTP, SSH, mail), and four parallel attacker VMs (Kali Linux) used tools including Nmap, Hping3, Xprobe2, Metasploit, and GoldenEye to orchestrate attacks (Koroniotis et al., 2018). Packet captures were converted to flow records using Argus, yielding 34–46 core statistical features per conversation, with subsequent aggregation for window-based statistics.

Attack coverage is deliberately broad: DoS and DDoS (TCP, UDP, HTTP), reconnaissance (OS fingerprinting, service scanning), information theft (keylogging, data exfiltration), and spam/AUB. Attack traffic dominates by four orders of magnitude over benign flows, ensuring that any detection method must explicitly manage severe class skew (Koroniotis et al., 2018, Abushwereb et al., 2022, Atuhurra et al., 27 Mar 2024). Automated labeling is enforced via synchronized firewall ACL schedules, attacker/victim IP tracking, and pcap time-window analysis, providing high-fidelity ground truth.

2. Feature Space, Label Taxonomy, and Record Structure

Each record is a flow summarized over bidirectional packet sequences, with features that include source/destination IP and port, protocol details, temporal (start/end time, duration), volumetric (total packets, total bytes, directional byte and packet counts), rates (packets/bytes per second), TCP flags, and state variables (Koroniotis et al., 2018, Atuhurra et al., 27 Mar 2024). Aggregated window-based measures (e.g., total bytes per source IP over 100 flows, connection rates by protocol and IP) augment fine-grained statistical modeling.

The label taxonomy uses three granularity levels:

Binary: "Normal" vs "Attack"
Main category: Normal, DoS, DDoS, Reconnaissance, Theft
Sub-category: e.g., DoS_TCP, DoS_HTTP, DDoS_UDP, Reconnaissance_OS, Theft_Keylogging

For support in both binary and multiclass regimes, the dataset provides "attack" (0/1), "category", and "subcategory" fields. Table 1 summarizes the approximate class distribution as measured on the full dataset and common experimental subsets.

Class	Count (full)	Count (5% sample)
Normal	9,543	477
DDoS_TCP	19,547,603	—
DDoS_UDP	18,965,106	—
DDoS_HTTP	19,771	—
DoS_TCP	12,315,997	—
DoS_UDP	20,659,491	—
DoS_HTTP	29,706	—
Scan/Reconnaissance	1,821,639	—
Keylogging	1,469	—
Data Theft	118	—
Total	73,370,440	~3,668,522

(Koroniotis et al., 2018, Atuhurra et al., 27 Mar 2024)

Extremely low baseline prevalence of "Normal" traffic (0.012–0.5%) is typical in experimental design and necessitates specialized preprocessing.

3. Preprocessing, Feature Engineering, and Class Imbalance Mitigation

Preprocessing steps are systematically described (Atuhurra et al., 27 Mar 2024, Pokhrel et al., 2021, Abushwereb et al., 2022):

Data cleansing: Removal of records with missing or NaN values; string features (protocol, state, flags) mapped numerically.
Normalization: Min–max scaling of all real-valued features to [0,1]; alternative pipelines apply z-score transformations.
Feature selection: Common approaches include χ² tests, random forest/mutual information ranking, and evolutionary search (e.g., PSO) to reduce dimensionality; typical pipelines use only high-discriminative features such as bytes, sbytes, dbytes, rate, pkts, srate, drate, spkts, dpkts (Pokhrel et al., 2021, Atuhurra et al., 27 Mar 2024, Al-Othman et al., 2020).
SMOTE oversampling: The Synthetic Minority Over-sampling Technique interpolates new normal-class samples by selecting k=5 nearest neighbors and generating synthetic flows via $x' = x_i + \gamma \cdot (x_l - x_i)$ , with $\gamma \sim \mathrm{Uniform}(0,1)$ . Only the training set is oversampled; testing is always on the original class proportions (Pokhrel et al., 2021, Atuhurra et al., 27 Mar 2024).
Train/test split: 70/30 or 80/20 random splits are customary, often accompanied by 5- or 10-fold cross-validation to estimate generalization in highly imbalanced data (Injadat et al., 2020, Abushwereb et al., 2022).

4. Experimental Methodologies and Detection Pipelines

A spectrum of machine learning and deep learning models have been evaluated on Bot-IoT flows:

Classic ML: Logistic regression, Gaussian Naive Bayes, K-Nearest Neighbors, SVM (linear and RBF), Random Forests, XGBoost, Decision Trees (Abushwereb et al., 2022, Pokhrel et al., 2021, Atuhurra et al., 27 Mar 2024).
Ensemble/Hybrid constructs: DRL–XGBoost, combining boosted trees with deep Q-networks (Zamani et al., 2022).
Deep Learning: Fully-connected neural networks, autoencoders, RNNs/LSTMs, convolutional NNs. PSO-driven selection is used for optimal feature subsets in some studies (Al-Othman et al., 2020).
Big Data platforms: Apache Spark MLlib for ingest, feature engineering, and parallel training at IoT scale (74M records, distributed), using string indexing and feature selectors (Abushwereb et al., 2022).

Pipelines universally start with normalization and (optionally) feature selection, then address imbalance by SMOTE or undersampling, followed by model fitting and held-out evaluation. Principal metrics are accuracy, recall, precision, F₁-score, AUC, fall-out (FPR), and inference time. Confusion matrices with true/false positives/negatives guide per-class performance reporting, though published works sometimes omit minority-class statistics (Zamani et al., 2022).

5. Class Imbalance: Causes, Effects, and Remediation

The dataset's most salient statistical feature is its extreme class imbalance, with attack flows vastly outnumbering normal ones (7,687:1 in the public CSV subset; up to 99.99:0.01 in full volume) (Atuhurra et al., 27 Mar 2024, Koroniotis et al., 2018, Abushwereb et al., 2022). This skews trivial classifiers toward the majority class, inflates overall accuracy, and renders F₁ and ROC-AUC essential for meaningful comparison. For instance, training on the imbalanced set can yield FPRs exceeding 30–50% for minority (normal) flows even when recall approaches 100%, but SMOTE balancing reduces FPRs to 1.3% or lower, AUCs to 99%+, and achieves >99.97% accuracy with negligible inference-time overhead (Atuhurra et al., 27 Mar 2024).

SMOTE, under-sampling, and cost-sensitive learning are the dominant mechanisms for remediation. No work reports success with simple random over-/under-sampling alone: synthetic interpolation or ensemble methods are crucial (Injadat et al., 2020, Pokhrel et al., 2021). Alternative approaches—GAN-based augmentation, cost-sensitive boosting—are suggested as future extensions (Zamani et al., 2022), but benchmarks are overwhelmingly based on SMOTE or class balancing at the training stage.

6. Empirical Results and Comparative Analysis

Aggregated published results demonstrate high attainable performance across multiple classifiers when proper balancing and feature engineering are applied:

DRL–XGBoost: 99.994% accuracy, MSE 0.1315 (no ROC AUC or F₁ reported) (Zamani et al., 2022).
Random Forest, XGBoost, SVM (RBF/linear), Logistic Regression, MLP: >99.97% accuracy, FPR <1.3%, AUC ≈ 99% (SMOTE-balanced), as low as FPR ≈ 0.03% for MLP (Atuhurra et al., 27 Mar 2024).
KNN: On balanced sets, ~92% accuracy, AUC = 92.2%; was most robust to class imbalance among simple MLAs (Pokhrel et al., 2021).
SVM, RNN, LSTM: Binary accuracy >99.9% (full features), with F₁ and fall-out varying depending on feature selection strategies and class labeling (Koroniotis et al., 2018, Al-Othman et al., 2020).
Big Data platforms: Apache Spark MLlib with Random Forest attains F₁=99.7% on binary partial sets; Decision Tree achieves F₁=97.9% on full 74M records (Abushwereb et al., 2022).

Ensemble/hybrid models and deep learning approaches typically outperform single-method solutions, but precise gains are confounded by class imbalance effects and lack of universal reporting on per-class metrics. Intrusion detection performance for rare subcategories (e.g., keylogging, data theft) is consistently lower, with class-support limiting both multiclass F₁ and recall (Abushwereb et al., 2022, Al-Othman et al., 2020).

7. Best Practices and Limitations

Analyses consistently recommend:

Full documentation: Consult the original Bot-IoT-2018 release for exhaustive feature definitions, per-feature statistics, and label taxonomy (Koroniotis et al., 2018, Zamani et al., 2022).
Comprehensive evaluation: Always report F₁, precision/recall, and ROC AUC, not just accuracy, due to extreme imbalance (Zamani et al., 2022, Atuhurra et al., 27 Mar 2024).
Explicit data splits: Use and document fixed train/test splits or stratified k-fold CV for reproducibility. Hyperparameter choices (tree depth, learning rates, regularization) should be reported and tuned by grid or Bayesian optimization (Zamani et al., 2022).
Imbalance remediation: Adopt SMOTE or cost-sensitive approaches and validate on original-class-proportion test sets to avoid optimistic estimates.
Feature selection validation: Ablation experiments to confirm continued informativeness post-selection are recommended, especially when mixing tree-based and deep learning stages (Zamani et al., 2022).

Limitations cited include the severe imbalance, under-representation of specific attack types, absence of device metadata, and inconsistencies in reporting per-class statistics. Realistic deployment demands further contextual features (e.g., device type), protocol expansion (e.g., MQTT, CoAP), and additional datasets for transferability assessment (Al-Othman et al., 2020, Koroniotis et al., 2018).

References:

(Koroniotis et al., 2018, Al-Othman et al., 2020, Injadat et al., 2020, Pokhrel et al., 2021, Abushwereb et al., 2022, Zamani et al., 2022, Atuhurra et al., 27 Mar 2024)