Papers
Topics
Authors
Recent
2000 character limit reached

InSDN Dataset for SDN Intrusion Detection

Updated 14 November 2025
  • InSDN dataset is a benchmark dataset designed for SDN intrusion detection, offering flow-level time series data collected under normal and adversarial conditions.
  • It employs adaptive MTF encoding and a Transformer-based model pipeline to capture temporal transitions and spatial protocol interactions for precise anomaly classification.
  • The dataset facilitates robust evaluation under data-loss scenarios, providing a comprehensive testbed for machine learning models in identifying various network attacks.

The InSDN dataset is a benchmark dataset designed for research and evaluation in the field of Software-Defined Networking (SDN) intrusion detection. Developed by Elsayed et al. in 2020, InSDN provides comprehensive, realistic flow-level time series data collected from OpenFlow switches under both normal and various attack conditions, enabling advanced research in temporal pattern recognition, anomaly detection, and intrusion classification within SDN environments.

1. Dataset Origin and Purpose

The InSDN dataset was created to address the need for realistic, diverse, and temporally rich network flow data for SDN-specific intrusion detection research. Unlike traditional network intrusion datasets, InSDN emphasizes SDN architectures, with flow statistics gathered from OpenFlow switches operating under typical and adversarial workloads. The dataset encompasses normal (benign) network traffic, alongside a wide spectrum of attack patterns, including Denial of Service (DoS), Distributed DoS (DDoS), web exploits, password brute-force, probing activities, and protocol exploitation. Its primary use case is the supervised benchmarking of machine learning models—especially those capable of time series analysis—for SDN intrusion detection, as well as providing a testbed for research on resilience to data sparsity and incomplete observations (Joshi et al., 22 Aug 2025).

2. Traffic Classes, Attack Coverage, and Sampling Protocol

InSDN consists of 16 traffic classes: one normal class and 15 attack classes that span several common attack vectors and tools.

Attack Family Tools/Variants # Classes
Denial-of-Service (DoS) LOIC, Slowhttptest, HULK, Torshammer, Nping, Metasploit 6
Distributed DoS (DDoS) Hping3 1
Web Attacks Metasploit, SQLmap 2
Brute-force Burp Suite, Hydra, Metasploit 3
Probe Nmap, Metasploit 2
Exploitation Metasploit 1
Normal - 1

Each sample corresponds to a time slot tt of duration τ\tau, during which L=N2L = N^2 multivariate time series are recorded, each representing the packet-flow count on a unique switch link. The total number of time slots and exact feature cardinality are determined experimentally, but each slot forms an independent example for classification purposes. An 80/20 train/test split is standard for experiments; moreover, data-loss conditions are systematically simulated by randomly removing either 20% or 40% of the flow records per slot, representing typical SDN data-sparsity scenarios (resulting in 80%, 60% data accessibility benchmarks).

3. Feature Representation and Preprocessing

Flow-level features at each time slot tt are given by Xt={xt,1,,xt,L}X_t = \{ x_{t,1}, …, x_{t,L} \}, where xt,lx_{t,l} is the packet count for link ll. In addition, a spatial/protocol indicator matrix StRN×N×2S_t \in \mathbb{R}^{N \times N \times 2} is provided, statically annotating the existence of links between IP pairs and protocol (TCP/UDP) via one-hot encoding. Critically, prior to encoding, the data is left unscaled, leveraging raw counts and minimal pre-processing.

The principal transformation step is the Markov Transition Field (MTF) encoding:

  • Adaptive Quantization: Data is quantized into QQ bins, where bin boundaries θ={θ1,...,θQ}\theta = \{ \theta_1, ..., \theta_Q \} are optimized during training via backpropagation: θi(t+1)=θi(t)ηLθi\theta_i^{(t+1)} = \theta_i^{(t)} - \eta \frac{\partial \mathcal{L}}{\partial \theta_i}.
  • Transition Probabilities: For each quantized time series, the first-order transition probabilities P(qiqj)P(q_i|q_j) are estimated as

P(qiqj)=count(xkqjxk+1qi)j=1Qcount(xkqj)P(q_i \mid q_j) = \frac{\text{count}\left( x_k \in q_j \wedge x_{k+1} \in q_i \right)}{\sum_{j=1}^{Q} \text{count}(x_k \in q_j)}

  • MTF Matrix Construction: The MTF at index (i,j)(i,j) encodes P(xt+1=qjxt=qi)P(x_{t+1} = q_j \mid x_t = q_i), capturing transition likelihoods for each binned value. For multivariate flows, MTF(Xt)={Mt,1,...,Mt,L}MTF(X_t) = \{ M_{t,1}, ..., M_{t,L} \}.
  • Dimensionality Reduction: Gaussian blurring G(x1,x2)=(1/(2πσ2))exp[(x12+x22)/(2σ2)]G(x_1,x_2) = (1/(2\pi \sigma^2)) \exp[-(x_1^2 + x_2^2)/(2\sigma^2)] is applied to each MTF to smooth and reduce its spatial size without sacrificing transition structure.

The resulting blurred, quantized MTF matrices serve as the principal feature set for downstream classification.

4. Model Input Encoding and Transformer Architecture

For SDN intrusion classification, MTF representations are mapped through a two-stage embedding and Transformer pipeline:

  • Each Gaussian-blurred MTF Mt,lM_{t,l} (per link ll) is flattened into a 1D sequence (length approximately τ/Q\lceil \tau / Q \rceil) and projected to a dmodel=512d_{model} = 512-dimensional embedding space.
  • Sinusoidal positional encodings, specified as PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos, 2i) = \sin(pos / 10000^{2i/d_{model}}), PE(pos,2i+1)=cos(pos/100002i/dmodel)PE(pos, 2i+1) = \cos(pos / 10000^{2i/d_{model}}), are added for temporal position-awareness.
  • The model consists of two stacked Multi-Head Self-Attention Transformer modules:
    • The first (feature-wise Transformer) independently processes each embedded Mt,lM_{t,l} with 8 attention heads, producing context vectors Ct,lC_{t,l}.
    • Feature vectors {Ct,1,...,Ct,L}\{ C_{t,1}, ..., C_{t,L} \} are concatenated as EtE_t.
    • The second Transformer (combined stage) fuses the protocol/topology encoding StS_t with EtE_t as E^t\hat{E}_t, enabling the network to learn interactions both in temporal transition and spatial/protocol structure.
  • Standard training hyperparameters include AdamW optimizer (5×1045\times 10^{-4} learning rate), batch size 128, hidden size 512, transformer-MLP layers = 4, dropout 0.2, and two fully connected output layers.

5. Performance Benchmarks, Robustness, and Ablation

Experiments employing the InSDN dataset have thoroughly compared the proposed MTF-aided Transformer approach with baseline models (KNN, Random Forest, LSTM, Donut), using multi-class averaged Precision, Recall, and F1-score, as well as resource consumption.

Data Accessibility Model Precision (%) Recall (%) F1-score (%) Training Time (s) Inference Time (ms)
100% Proposed 99.8 99.7 99.6 1200 8
KNN 92.1 91.5 90.8 900 15
RF 94.5 93.9 93.2 1100 12
LSTM 95.2 94.6 94.0 2500 20
Donut 96.3 95.7 95.0 2800 18
80% Proposed 98.5 98.3 98.2
60% Proposed 98.3 98.1 98.0

Ablation studies underscore the importance of the MTF transformation and Transformer layers; omitting the MTF drops F1 to 94.9%, and without the Transformer, F1 declines to 93.2%. The model’s resilience to missing data is evidenced by negligible F1-score reduction (from 99.6% at 100% data, to 98.1% at 60% data accessibility), while baseline models degrade substantially.

6. Data Access, Structure, and Usage

The InSDN dataset is distributed with the following structure for each experiment:

  • Samples indexed by time slot tt, each containing LL time series xt,lx_{t,l} and associated protocol/topology matrix StS_t.
  • Labels for 16 classes at the slot level (benign, attack types).
  • Data accessibility variants (100%, 80%, 60% drop simulation).
  • Standard train/test splits.
  • Access is granted for research with comprehensive metadata and class labels. This suggests that users can extend temporal slicing, integrate with external SDN controllers, or simulate different network sizes by varying NN and τ\tau.

The dataset has been used to develop and evaluate temporal models for SDN anomaly and attack detection, facilitating research on the intersection of time series analytics and SDN security.

7. Significance and Future Directions

InSDN’s principal contribution lies in its realistic modeling of SDN operations, coverage of diverse attack classes, and explicit design for evaluating temporal sequence models under sparse data regimes. The empirical results highlight not only the utility of time series representations (specifically, adaptive MTF encoding), but also the effectiveness of Transformer architectures for intrusion detection in SDNs—especially for low-data scenarios typical in operational settings (Joshi et al., 22 Aug 2025). A plausible implication is that future research using InSDN can further investigate unsupervised anomaly detection, continual learning under topology changes, and real-time attack mitigation strategies tailored to SDN architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InSDN Dataset.