InSDN Dataset for SDN Intrusion Detection
- InSDN dataset is a benchmark dataset designed for SDN intrusion detection, offering flow-level time series data collected under normal and adversarial conditions.
- It employs adaptive MTF encoding and a Transformer-based model pipeline to capture temporal transitions and spatial protocol interactions for precise anomaly classification.
- The dataset facilitates robust evaluation under data-loss scenarios, providing a comprehensive testbed for machine learning models in identifying various network attacks.
The InSDN dataset is a benchmark dataset designed for research and evaluation in the field of Software-Defined Networking (SDN) intrusion detection. Developed by Elsayed et al. in 2020, InSDN provides comprehensive, realistic flow-level time series data collected from OpenFlow switches under both normal and various attack conditions, enabling advanced research in temporal pattern recognition, anomaly detection, and intrusion classification within SDN environments.
1. Dataset Origin and Purpose
The InSDN dataset was created to address the need for realistic, diverse, and temporally rich network flow data for SDN-specific intrusion detection research. Unlike traditional network intrusion datasets, InSDN emphasizes SDN architectures, with flow statistics gathered from OpenFlow switches operating under typical and adversarial workloads. The dataset encompasses normal (benign) network traffic, alongside a wide spectrum of attack patterns, including Denial of Service (DoS), Distributed DoS (DDoS), web exploits, password brute-force, probing activities, and protocol exploitation. Its primary use case is the supervised benchmarking of machine learning models—especially those capable of time series analysis—for SDN intrusion detection, as well as providing a testbed for research on resilience to data sparsity and incomplete observations (Joshi et al., 22 Aug 2025).
2. Traffic Classes, Attack Coverage, and Sampling Protocol
InSDN consists of 16 traffic classes: one normal class and 15 attack classes that span several common attack vectors and tools.
| Attack Family | Tools/Variants | # Classes |
|---|---|---|
| Denial-of-Service (DoS) | LOIC, Slowhttptest, HULK, Torshammer, Nping, Metasploit | 6 |
| Distributed DoS (DDoS) | Hping3 | 1 |
| Web Attacks | Metasploit, SQLmap | 2 |
| Brute-force | Burp Suite, Hydra, Metasploit | 3 |
| Probe | Nmap, Metasploit | 2 |
| Exploitation | Metasploit | 1 |
| Normal | - | 1 |
Each sample corresponds to a time slot of duration , during which multivariate time series are recorded, each representing the packet-flow count on a unique switch link. The total number of time slots and exact feature cardinality are determined experimentally, but each slot forms an independent example for classification purposes. An 80/20 train/test split is standard for experiments; moreover, data-loss conditions are systematically simulated by randomly removing either 20% or 40% of the flow records per slot, representing typical SDN data-sparsity scenarios (resulting in 80%, 60% data accessibility benchmarks).
3. Feature Representation and Preprocessing
Flow-level features at each time slot are given by , where is the packet count for link . In addition, a spatial/protocol indicator matrix is provided, statically annotating the existence of links between IP pairs and protocol (TCP/UDP) via one-hot encoding. Critically, prior to encoding, the data is left unscaled, leveraging raw counts and minimal pre-processing.
The principal transformation step is the Markov Transition Field (MTF) encoding:
- Adaptive Quantization: Data is quantized into bins, where bin boundaries are optimized during training via backpropagation: .
- Transition Probabilities: For each quantized time series, the first-order transition probabilities are estimated as
- MTF Matrix Construction: The MTF at index encodes , capturing transition likelihoods for each binned value. For multivariate flows, .
- Dimensionality Reduction: Gaussian blurring is applied to each MTF to smooth and reduce its spatial size without sacrificing transition structure.
The resulting blurred, quantized MTF matrices serve as the principal feature set for downstream classification.
4. Model Input Encoding and Transformer Architecture
For SDN intrusion classification, MTF representations are mapped through a two-stage embedding and Transformer pipeline:
- Each Gaussian-blurred MTF (per link ) is flattened into a 1D sequence (length approximately ) and projected to a -dimensional embedding space.
- Sinusoidal positional encodings, specified as , , are added for temporal position-awareness.
- The model consists of two stacked Multi-Head Self-Attention Transformer modules:
- The first (feature-wise Transformer) independently processes each embedded with 8 attention heads, producing context vectors .
- Feature vectors are concatenated as .
- The second Transformer (combined stage) fuses the protocol/topology encoding with as , enabling the network to learn interactions both in temporal transition and spatial/protocol structure.
- Standard training hyperparameters include AdamW optimizer ( learning rate), batch size 128, hidden size 512, transformer-MLP layers = 4, dropout 0.2, and two fully connected output layers.
5. Performance Benchmarks, Robustness, and Ablation
Experiments employing the InSDN dataset have thoroughly compared the proposed MTF-aided Transformer approach with baseline models (KNN, Random Forest, LSTM, Donut), using multi-class averaged Precision, Recall, and F1-score, as well as resource consumption.
| Data Accessibility | Model | Precision (%) | Recall (%) | F1-score (%) | Training Time (s) | Inference Time (ms) |
|---|---|---|---|---|---|---|
| 100% | Proposed | 99.8 | 99.7 | 99.6 | 1200 | 8 |
| KNN | 92.1 | 91.5 | 90.8 | 900 | 15 | |
| RF | 94.5 | 93.9 | 93.2 | 1100 | 12 | |
| LSTM | 95.2 | 94.6 | 94.0 | 2500 | 20 | |
| Donut | 96.3 | 95.7 | 95.0 | 2800 | 18 | |
| 80% | Proposed | 98.5 | 98.3 | 98.2 | ||
| 60% | Proposed | 98.3 | 98.1 | 98.0 |
Ablation studies underscore the importance of the MTF transformation and Transformer layers; omitting the MTF drops F1 to 94.9%, and without the Transformer, F1 declines to 93.2%. The model’s resilience to missing data is evidenced by negligible F1-score reduction (from 99.6% at 100% data, to 98.1% at 60% data accessibility), while baseline models degrade substantially.
6. Data Access, Structure, and Usage
The InSDN dataset is distributed with the following structure for each experiment:
- Samples indexed by time slot , each containing time series and associated protocol/topology matrix .
- Labels for 16 classes at the slot level (benign, attack types).
- Data accessibility variants (100%, 80%, 60% drop simulation).
- Standard train/test splits.
- Access is granted for research with comprehensive metadata and class labels. This suggests that users can extend temporal slicing, integrate with external SDN controllers, or simulate different network sizes by varying and .
The dataset has been used to develop and evaluate temporal models for SDN anomaly and attack detection, facilitating research on the intersection of time series analytics and SDN security.
7. Significance and Future Directions
InSDN’s principal contribution lies in its realistic modeling of SDN operations, coverage of diverse attack classes, and explicit design for evaluating temporal sequence models under sparse data regimes. The empirical results highlight not only the utility of time series representations (specifically, adaptive MTF encoding), but also the effectiveness of Transformer architectures for intrusion detection in SDNs—especially for low-data scenarios typical in operational settings (Joshi et al., 22 Aug 2025). A plausible implication is that future research using InSDN can further investigate unsupervised anomaly detection, continual learning under topology changes, and real-time attack mitigation strategies tailored to SDN architectures.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free