GeNIS Dataset: Benchmark for NIDS

Updated 13 November 2025

GeNIS Dataset is a modular, flow-based benchmark that simulates realistic network traffic with both benign and multi-stage attack campaigns.
The dataset applies rigorous preprocessing and feature selection, ensuring robust evaluation of machine learning and deep learning models for network intrusion detection.
Baseline models, including Random Forest, XGBoost, and LightGBM, achieved near-perfect F1 scores, demonstrating the dataset’s practical impact in modern cybersecurity research.

The GeNIS dataset is a modular, flow-based benchmark designed to support artificial intelligence-driven network intrusion detection systems (NIDS) in small-to-medium enterprise (SME) environments. It was specifically created to address the lack of diverse, recent, and behaviorally rich network datasets necessary for reliable evaluation and training of ML and deep learning (DL) models in cybersecurity contexts (Silva et al., 11 Nov 2025). GeNIS simulates realistic user and administrator activity as well as eight multi-stage attack campaigns, providing comprehensive time-based and quantity-based behavioral features to enable robust cyberattack classification.

1. Collection Methodology and Dataset Composition

GeNIS was constructed by scripting “user” and “admin” behaviors over working weekdays, alongside generation of idle background traffic on weekends. Two Kali Linux attacker machines, placed in separate subnets, executed multi-stage attack scenarios including reconnaissance (e.g., port scanning, banner grabbing), bruteforce authentication attempts (SSH, FTP, Active Directory), and diverse denial-of-service (SYN floods, HTTP floods, DNS amplification). This design ensures that both legitimate and malicious activity are rooted in realistic operational context.

Raw packets were collected and aggregated into flows using the HERA flow exporter over 60-second intervals, yielding an initial set of 125 features per flow (3 ground-truth labels and 122 behavioral/contextual fields). Post-processing included removal of empty data, exclusion of topology-specific identifiers (such as IP addresses, MAC addresses, and VLAN tags), and one-hot encoding of the “State,” “Flags,” and “Protocol” fields for categorical coverage. The final feature set per flow comprises 87 numeric fields. The data splits are as follows:

Split	Training Flows	Testing Flows
Binary	294,844	73,712
Multi	294,844	73,712

Attack–class distributions for multiclass tasks are dominated by Denial-of-Service flows, with smaller but substantial representation for reconnaissance and bruteforce campaigns.

2. Feature Taxonomy and Definitions

Behavioral features in GeNIS are organized into five groups, with the dataset emphasizing time-based and quantity-based metrics for flow characterization. Each flow is modeled as a sequence of $N$ packets $p_1, \ldots, p_N$ , each packet arriving at timestamp $t_i$ , size $b_i$ bytes, and direction $dir_i \in \{$ src $\rightarrow$ dst, dst $\rightarrow$ src $\}$ .

Time-Based Features (29 extracted):

Flow duration: $Dur = t_N - t_1$
Packet inter-arrival statistics:
- $\min_{2\leq i\leq N} (t_i-t_{i-1})$ , $\max_{2\leq i\leq N} (t_i-t_{i-1})$
- Mean, sum of inter-packet intervals
Runtime vs idle time: $RunTime = \sum_{i:\Delta t_i\leq\tau} \Delta t_i$ (for inactivity threshold $\tau$ )
Start time (timestamp of $p_1$ )

Quantity-Based Features (38 extracted):

Packet and byte counts/directions:
- $TotPkts = N$ ; $SrcPkts$ , $DstPkts$
- $TotBytes = \sum b_i$ ; $SrcBytes$ , $DstBytes$
TCP-level and application-layer volumes:
- $SrcTCPBase = \sum hdrTCP_i$ , $DstTCPBase = \sum hdrTCP_i$
- $SAppBytes$ , $DAppBytes$
Initial TCP window sizes: $SrcWin$ , $DstWin$
Composite rates:
- Byte rate: $Rate = TotBytes / Dur$
- Packet rate: $Load = TotPkts / Dur$
- Directional rates: $SrcRate$ , $DstRate$ , $SrcLoad$ , $DstLoad$

This structure facilitates detection of behavioral anomalies that may signal sophisticated attacks, and by removing topology-dependent features, supports dataset transferability and generalization.

3. Attack Classes, Labeling, and Split Distributions

GeNIS supports both binary (Benign vs Malicious) and multiclass (Benign, DoS, Reconnaissance, Bruteforce) classification tasks. Labeling was performed at the flow level, using the operational context of each flow to determine the ground truth category.

Class	Training Flows	Testing Flows	Distribution (%) (Train)
DoS	236,512	59,128	80.22
Recon	22,186	5,547	7.52
Bruteforce	14,426	3,607	4.89
Benign	21,720	5,430	7.37

In binary mode, “Malicious” encompasses any of the three attack categories. This class balance reflects real-world attack patterns, notably the predominance of DoS-like activity, while maintaining sufficient representation in other categories for robust multiclass training and benchmarking.

4. Preprocessing and Feature Selection Protocols

The dataset’s authors implemented a rigorous preprocessing pipeline:

Automated removal of empty rows/columns.
Exclusion of topology/contextual fields to prevent overfitting and hardcoded IP-based shortcuts.
One-hot encoding for categorical variables and feature standardization (zero mean, unit variance).

For feature selection, five statistical techniques were independently computed and their normalized importance scores aggregated:

Information Gain: $IG(X_f) = H(Y) - H(Y|X_f)$ , quantifying the entropy reduction from knowing feature $X_f$ .
Chi-Squared Test: $\chi^2_f = \sum (O_{i,c} - E_{i,c})^2/E_{i,c}$ over bins $i$ and classes $c$ .
Recursive Feature Elimination (RFE): Iterative backward elimination using a tree-ensemble base estimator.
Mean Absolute Deviation (MAD): $MAD_f = \frac{1}{N}\sum_{i=1}^N |x_{i,f} - \mu_f|$ .
Dispersion Ratio (DR): $DR_f = \sqrt{\eta_f^2}$ , where $\eta_f^2$ quantifies between-class variance.

The union of the top 16 features by cumulative score (approximately 70% cumulative importance) formed the behavior-focused final subsets for both classification regimes.

5. Baseline Classification Results and Model Architectures

Five classifiers were validated: three tree ensemble models (Random Forest, XGBoost, LightGBM) and two DL architectures (LSTM, MLP). Representative configurations:

Random Forest: 100 trees, depth=16, min_samples_leaf=1
XGBoost: 100 estimators, learning rate $\eta=0.2$ , max_depth $\in$ [4,16], colsample_bytree = [0.8,0.9]
LightGBM: 100 trees, $\eta=0.05$ , num_leaves=15, feature_fraction=0.8
LSTM: 64/128 hidden units, dense output, dropout=0.2, Adam optimizer ( $lr=0.001$ ), batch=32, up to 30 epochs
MLP: 2 hidden dense layers (64/128, 32/64), identical training as LSTM

All inputs were standardized.

Classification Performance Metrics (best on all 55 features):

Task	Model	F1 (%)	Accuracy (%)	FPR (%)
Binary	RF/XG	99.995	99.99	0.13
	LGBM	99.99	–	–
	DL	~99.99	–	0.13
Multiclass	RF	99.982	99.992	0.092
	XGB	99.982	–	–
	LGBM	99.976	–	–
	LSTM	99.966	–	–
	MLP	99.932	–	–

Dimensionality reduction to 51 features resulted in $<$ 0.1% reduction in $F_1$ , with approximately 2x faster inference and halved tree training time; DL models benefited as well, though with less pronounced epoch-time savings.

6. Quality Assessment, Generalization, and Practical Best Practices

GeNIS’ use of HERA for flow extraction corrects deficiencies noted in older datasets using CICFlowMeter. By eliminating identifiers specific to a single network context, the dataset ensures that ML/DL models learn behavioral characteristics rather than spurious correlations with fixed addresses, a crucial property for generalization to unseen environments. High $F_1$ and low false positive rates (FPR) on both binary and multiclass splits reflect strong differentiation between benign and malicious activity reflective of modern threat profiles.

Tree ensemble models, particularly LightGBM, optimize accuracy–latency trade-offs, while neural networks offer comparable accuracy at the cost of 3–10x slower training and inference. A 16–20% feature reduction (e.g., 55 to 51) enables roughly 50% reduction in tree ensemble training time and 30% in inference time, with negligible impact on generalization. This suggests that streamlined feature sets are practical for deployment scenarios requiring low-latency detection.

Best practices, as derived from the reported usage and ablation studies, include:

Leveraging the modular structure of GeNIS to select subsets tailored to specific services or attack profiles.
Rigorous preprocessing—removal of context-dependent fields, careful encoding, and input standardization—is essential for robust cross-network deployment.
Utilizing multiple, complementary feature selection strategies to focus model attention on behavioral signatures.
Employing ensemble models as computationally efficient baselines, reserving DL approaches for scenarios where sequence modeling or payload analysis is indispensable.
Assessing transferability by augmenting GeNIS-derived models with additional modern datasets and challenging them with adversarial variations.

7. Application Scenarios and Research Impact

The GeNIS dataset is positioned as a high-fidelity, contemporary benchmark for research and real-world development of AI-based NIDS, particularly in SME contexts lacking access to proprietary traffic logs (Silva et al., 11 Nov 2025). Its mixture of realistic benign and attack behaviors, modular design, and exclusion of non-generalizable fields makes it well-suited for benchmarking, model selection, and algorithmic ablation studies. The explicit protocol for training/testing splits, feature selection pipeline, and public description of model baselines enable reproducible experimentation and meaningful comparison across studies.

A plausible implication is that GeNIS serves both as a direct benchmarking tool and as a template for the construction of future datasets prioritizing behavioral realism and cross-environment generalization. Researchers are advised to use the dataset not only to establish performance baselines but also to probe the limits of generalization, transfer learning, and adversarial robustness using standardized, reproducible protocols.

PDF Markdown Chat (Pro)

References (1)

Binary and Multiclass Cyberattack Classification on GeNIS Dataset (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GeNIS Dataset.

GeNIS Dataset: Benchmark for NIDS

1. Collection Methodology and Dataset Composition

2. Feature Taxonomy and Definitions

3. Attack Classes, Labeling, and Split Distributions

4. Preprocessing and Feature Selection Protocols

5. Baseline Classification Results and Model Architectures

6. Quality Assessment, Generalization, and Practical Best Practices

7. Application Scenarios and Research Impact

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GeNIS Dataset: Benchmark for NIDS

1. Collection Methodology and Dataset Composition

2. Feature Taxonomy and Definitions

3. Attack Classes, Labeling, and Split Distributions

4. Preprocessing and Feature Selection Protocols

5. Baseline Classification Results and Model Architectures

6. Quality Assessment, Generalization, and Practical Best Practices

7. Application Scenarios and Research Impact

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research