InSDN Dataset for SDN Intrusion Detection

Updated 14 November 2025

InSDN dataset is a benchmark dataset designed for SDN intrusion detection, offering flow-level time series data collected under normal and adversarial conditions.
It employs adaptive MTF encoding and a Transformer-based model pipeline to capture temporal transitions and spatial protocol interactions for precise anomaly classification.
The dataset facilitates robust evaluation under data-loss scenarios, providing a comprehensive testbed for machine learning models in identifying various network attacks.

The InSDN dataset is a benchmark dataset designed for research and evaluation in the field of Software-Defined Networking (SDN) intrusion detection. Developed by Elsayed et al. in 2020, InSDN provides comprehensive, realistic flow-level time series data collected from OpenFlow switches under both normal and various attack conditions, enabling advanced research in temporal pattern recognition, anomaly detection, and intrusion classification within SDN environments.

1. Dataset Origin and Purpose

The InSDN dataset was created to address the need for realistic, diverse, and temporally rich network flow data for SDN-specific intrusion detection research. Unlike traditional network intrusion datasets, InSDN emphasizes SDN architectures, with flow statistics gathered from OpenFlow switches operating under typical and adversarial workloads. The dataset encompasses normal (benign) network traffic, alongside a wide spectrum of attack patterns, including Denial of Service (DoS), Distributed DoS (DDoS), web exploits, password brute-force, probing activities, and protocol exploitation. Its primary use case is the supervised benchmarking of machine learning models—especially those capable of time series analysis—for SDN intrusion detection, as well as providing a testbed for research on resilience to data sparsity and incomplete observations (Joshi et al., 22 Aug 2025).

2. Traffic Classes, Attack Coverage, and Sampling Protocol

InSDN consists of 16 traffic classes: one normal class and 15 attack classes that span several common attack vectors and tools.

Attack Family	Tools/Variants	# Classes
Denial-of-Service (DoS)	LOIC, Slowhttptest, HULK, Torshammer, Nping, Metasploit	6
Distributed DoS (DDoS)	Hping3	1
Web Attacks	Metasploit, SQLmap	2
Brute-force	Burp Suite, Hydra, Metasploit	3
Probe	Nmap, Metasploit	2
Exploitation	Metasploit	1
Normal	-	1

Each sample corresponds to a time slot $t$ of duration $\tau$ , during which $L = N^2$ multivariate time series are recorded, each representing the packet-flow count on a unique switch link. The total number of time slots and exact feature cardinality are determined experimentally, but each slot forms an independent example for classification purposes. An 80/20 train/test split is standard for experiments; moreover, data-loss conditions are systematically simulated by randomly removing either 20% or 40% of the flow records per slot, representing typical SDN data-sparsity scenarios (resulting in 80%, 60% data accessibility benchmarks).

3. Feature Representation and Preprocessing

Flow-level features at each time slot $t$ are given by $X_t = \{ x_{t,1}, …, x_{t,L} \}$ , where $x_{t,l}$ is the packet count for link $l$ . In addition, a spatial/protocol indicator matrix $S_t \in \mathbb{R}^{N \times N \times 2}$ is provided, statically annotating the existence of links between IP pairs and protocol (TCP/UDP) via one-hot encoding. Critically, prior to encoding, the data is left unscaled, leveraging raw counts and minimal pre-processing.

The principal transformation step is the Markov Transition Field (MTF) encoding:

Adaptive Quantization: Data is quantized into $Q$ bins, where bin boundaries $\theta = \{ \theta_1, ..., \theta_Q \}$ are optimized during training via backpropagation: $\theta_i^{(t+1)} = \theta_i^{(t)} - \eta \frac{\partial \mathcal{L}}{\partial \theta_i}$ .
Transition Probabilities: For each quantized time series, the first-order transition probabilities $P(q_i|q_j)$ are estimated as

$P(q_i \mid q_j) = \frac{\text{count}\left( x_k \in q_j \wedge x_{k+1} \in q_i \right)}{\sum_{j=1}^{Q} \text{count}(x_k \in q_j)}$

MTF Matrix Construction: The MTF at index $(i,j)$ encodes $P(x_{t+1} = q_j \mid x_t = q_i)$ , capturing transition likelihoods for each binned value. For multivariate flows, $MTF(X_t) = \{ M_{t,1}, ..., M_{t,L} \}$ .
Dimensionality Reduction: Gaussian blurring $G(x_1,x_2) = (1/(2\pi \sigma^2)) \exp[-(x_1^2 + x_2^2)/(2\sigma^2)]$ is applied to each MTF to smooth and reduce its spatial size without sacrificing transition structure.

The resulting blurred, quantized MTF matrices serve as the principal feature set for downstream classification.

4. Model Input Encoding and Transformer Architecture

For SDN intrusion classification, MTF representations are mapped through a two-stage embedding and Transformer pipeline:

Each Gaussian-blurred MTF $M_{t,l}$ (per link $l$ ) is flattened into a 1D sequence (length approximately $\lceil \tau / Q \rceil$ ) and projected to a $d_{model} = 512$ -dimensional embedding space.
Sinusoidal positional encodings, specified as $PE(pos, 2i) = \sin(pos / 10000^{2i/d_{model}})$ , $PE(pos, 2i+1) = \cos(pos / 10000^{2i/d_{model}})$ , are added for temporal position-awareness.
The model consists of two stacked Multi-Head Self-Attention Transformer modules:
- The first (feature-wise Transformer) independently processes each embedded $M_{t,l}$ with 8 attention heads, producing context vectors $C_{t,l}$ .
- Feature vectors $\{ C_{t,1}, ..., C_{t,L} \}$ are concatenated as $E_t$ .
- The second Transformer (combined stage) fuses the protocol/topology encoding $S_t$ with $E_t$ as $\hat{E}_t$ , enabling the network to learn interactions both in temporal transition and spatial/protocol structure.
Standard training hyperparameters include AdamW optimizer ( $5\times 10^{-4}$ learning rate), batch size 128, hidden size 512, transformer-MLP layers = 4, dropout 0.2, and two fully connected output layers.

5. Performance Benchmarks, Robustness, and Ablation

Experiments employing the InSDN dataset have thoroughly compared the proposed MTF-aided Transformer approach with baseline models (KNN, Random Forest, LSTM, Donut), using multi-class averaged Precision, Recall, and F1-score, as well as resource consumption.

Data Accessibility	Model	Precision (%)	Recall (%)	F1-score (%)	Training Time (s)	Inference Time (ms)
100%	Proposed	99.8	99.7	99.6	1200	8
	KNN	92.1	91.5	90.8	900	15
	RF	94.5	93.9	93.2	1100	12
	LSTM	95.2	94.6	94.0	2500	20
	Donut	96.3	95.7	95.0	2800	18
80%	Proposed	98.5	98.3	98.2
60%	Proposed	98.3	98.1	98.0

Ablation studies underscore the importance of the MTF transformation and Transformer layers; omitting the MTF drops F1 to 94.9%, and without the Transformer, F1 declines to 93.2%. The model’s resilience to missing data is evidenced by negligible F1-score reduction (from 99.6% at 100% data, to 98.1% at 60% data accessibility), while baseline models degrade substantially.

6. Data Access, Structure, and Usage

The InSDN dataset is distributed with the following structure for each experiment:

Samples indexed by time slot $t$ , each containing $L$ time series $x_{t,l}$ and associated protocol/topology matrix $S_t$ .
Labels for 16 classes at the slot level (benign, attack types).
Data accessibility variants (100%, 80%, 60% drop simulation).
Standard train/test splits.
Access is granted for research with comprehensive metadata and class labels. This suggests that users can extend temporal slicing, integrate with external SDN controllers, or simulate different network sizes by varying $N$ and $\tau$ .

The dataset has been used to develop and evaluate temporal models for SDN anomaly and attack detection, facilitating research on the intersection of time series analytics and SDN security.

7. Significance and Future Directions

InSDN’s principal contribution lies in its realistic modeling of SDN operations, coverage of diverse attack classes, and explicit design for evaluating temporal sequence models under sparse data regimes. The empirical results highlight not only the utility of time series representations (specifically, adaptive MTF encoding), but also the effectiveness of Transformer architectures for intrusion detection in SDNs—especially for low-data scenarios typical in operational settings (Joshi et al., 22 Aug 2025). A plausible implication is that future research using InSDN can further investigate unsupervised anomaly detection, continual learning under topology changes, and real-time attack mitigation strategies tailored to SDN architectures.

PDF Markdown Chat (Pro)

References (1)

Time Series Based Network Intrusion Detection using MTF-Aided Transformer (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to InSDN Dataset.