GeNIS Dataset: Benchmark for NIDS
- GeNIS Dataset is a modular, flow-based benchmark that simulates realistic network traffic with both benign and multi-stage attack campaigns.
- The dataset applies rigorous preprocessing and feature selection, ensuring robust evaluation of machine learning and deep learning models for network intrusion detection.
- Baseline models, including Random Forest, XGBoost, and LightGBM, achieved near-perfect F1 scores, demonstrating the dataset’s practical impact in modern cybersecurity research.
The GeNIS dataset is a modular, flow-based benchmark designed to support artificial intelligence-driven network intrusion detection systems (NIDS) in small-to-medium enterprise (SME) environments. It was specifically created to address the lack of diverse, recent, and behaviorally rich network datasets necessary for reliable evaluation and training of ML and deep learning (DL) models in cybersecurity contexts (Silva et al., 11 Nov 2025). GeNIS simulates realistic user and administrator activity as well as eight multi-stage attack campaigns, providing comprehensive time-based and quantity-based behavioral features to enable robust cyberattack classification.
1. Collection Methodology and Dataset Composition
GeNIS was constructed by scripting “user” and “admin” behaviors over working weekdays, alongside generation of idle background traffic on weekends. Two Kali Linux attacker machines, placed in separate subnets, executed multi-stage attack scenarios including reconnaissance (e.g., port scanning, banner grabbing), bruteforce authentication attempts (SSH, FTP, Active Directory), and diverse denial-of-service (SYN floods, HTTP floods, DNS amplification). This design ensures that both legitimate and malicious activity are rooted in realistic operational context.
Raw packets were collected and aggregated into flows using the HERA flow exporter over 60-second intervals, yielding an initial set of 125 features per flow (3 ground-truth labels and 122 behavioral/contextual fields). Post-processing included removal of empty data, exclusion of topology-specific identifiers (such as IP addresses, MAC addresses, and VLAN tags), and one-hot encoding of the “State,” “Flags,” and “Protocol” fields for categorical coverage. The final feature set per flow comprises 87 numeric fields. The data splits are as follows:
| Split | Training Flows | Testing Flows |
|---|---|---|
| Binary | 294,844 | 73,712 |
| Multi | 294,844 | 73,712 |
Attack–class distributions for multiclass tasks are dominated by Denial-of-Service flows, with smaller but substantial representation for reconnaissance and bruteforce campaigns.
2. Feature Taxonomy and Definitions
Behavioral features in GeNIS are organized into five groups, with the dataset emphasizing time-based and quantity-based metrics for flow characterization. Each flow is modeled as a sequence of packets , each packet arriving at timestamp , size bytes, and direction srcdst, dstsrc.
Time-Based Features (29 extracted):
- Flow duration:
- Packet inter-arrival statistics:
- ,
- Mean, sum of inter-packet intervals
- Runtime vs idle time: (for inactivity threshold )
- Start time (timestamp of )
Quantity-Based Features (38 extracted):
- Packet and byte counts/directions:
- ; ,
- ; ,
- TCP-level and application-layer volumes:
- ,
- ,
- Initial TCP window sizes: ,
- Composite rates:
- Byte rate:
- Packet rate:
- Directional rates: , , ,
This structure facilitates detection of behavioral anomalies that may signal sophisticated attacks, and by removing topology-dependent features, supports dataset transferability and generalization.
3. Attack Classes, Labeling, and Split Distributions
GeNIS supports both binary (Benign vs Malicious) and multiclass (Benign, DoS, Reconnaissance, Bruteforce) classification tasks. Labeling was performed at the flow level, using the operational context of each flow to determine the ground truth category.
| Class | Training Flows | Testing Flows | Distribution (%) (Train) |
|---|---|---|---|
| DoS | 236,512 | 59,128 | 80.22 |
| Recon | 22,186 | 5,547 | 7.52 |
| Bruteforce | 14,426 | 3,607 | 4.89 |
| Benign | 21,720 | 5,430 | 7.37 |
In binary mode, “Malicious” encompasses any of the three attack categories. This class balance reflects real-world attack patterns, notably the predominance of DoS-like activity, while maintaining sufficient representation in other categories for robust multiclass training and benchmarking.
4. Preprocessing and Feature Selection Protocols
The dataset’s authors implemented a rigorous preprocessing pipeline:
- Automated removal of empty rows/columns.
- Exclusion of topology/contextual fields to prevent overfitting and hardcoded IP-based shortcuts.
- One-hot encoding for categorical variables and feature standardization (zero mean, unit variance).
For feature selection, five statistical techniques were independently computed and their normalized importance scores aggregated:
- Information Gain: , quantifying the entropy reduction from knowing feature .
- Chi-Squared Test: over bins and classes .
- Recursive Feature Elimination (RFE): Iterative backward elimination using a tree-ensemble base estimator.
- Mean Absolute Deviation (MAD): .
- Dispersion Ratio (DR): , where quantifies between-class variance.
The union of the top 16 features by cumulative score (approximately 70% cumulative importance) formed the behavior-focused final subsets for both classification regimes.
5. Baseline Classification Results and Model Architectures
Five classifiers were validated: three tree ensemble models (Random Forest, XGBoost, LightGBM) and two DL architectures (LSTM, MLP). Representative configurations:
- Random Forest: 100 trees, depth=16, min_samples_leaf=1
- XGBoost: 100 estimators, learning rate , max_depth [4,16], colsample_bytree = [0.8,0.9]
- LightGBM: 100 trees, , num_leaves=15, feature_fraction=0.8
- LSTM: 64/128 hidden units, dense output, dropout=0.2, Adam optimizer (), batch=32, up to 30 epochs
- MLP: 2 hidden dense layers (64/128, 32/64), identical training as LSTM
All inputs were standardized.
Classification Performance Metrics (best on all 55 features):
| Task | Model | F1 (%) | Accuracy (%) | FPR (%) |
|---|---|---|---|---|
| Binary | RF/XG | 99.995 | 99.99 | 0.13 |
| LGBM | 99.99 | – | – | |
| DL | ~99.99 | – | 0.13 | |
| Multiclass | RF | 99.982 | 99.992 | 0.092 |
| XGB | 99.982 | – | – | |
| LGBM | 99.976 | – | – | |
| LSTM | 99.966 | – | – | |
| MLP | 99.932 | – | – |
Dimensionality reduction to 51 features resulted in 0.1% reduction in , with approximately 2x faster inference and halved tree training time; DL models benefited as well, though with less pronounced epoch-time savings.
6. Quality Assessment, Generalization, and Practical Best Practices
GeNIS’ use of HERA for flow extraction corrects deficiencies noted in older datasets using CICFlowMeter. By eliminating identifiers specific to a single network context, the dataset ensures that ML/DL models learn behavioral characteristics rather than spurious correlations with fixed addresses, a crucial property for generalization to unseen environments. High and low false positive rates (FPR) on both binary and multiclass splits reflect strong differentiation between benign and malicious activity reflective of modern threat profiles.
Tree ensemble models, particularly LightGBM, optimize accuracy–latency trade-offs, while neural networks offer comparable accuracy at the cost of 3–10x slower training and inference. A 16–20% feature reduction (e.g., 55 to 51) enables roughly 50% reduction in tree ensemble training time and 30% in inference time, with negligible impact on generalization. This suggests that streamlined feature sets are practical for deployment scenarios requiring low-latency detection.
Best practices, as derived from the reported usage and ablation studies, include:
- Leveraging the modular structure of GeNIS to select subsets tailored to specific services or attack profiles.
- Rigorous preprocessing—removal of context-dependent fields, careful encoding, and input standardization—is essential for robust cross-network deployment.
- Utilizing multiple, complementary feature selection strategies to focus model attention on behavioral signatures.
- Employing ensemble models as computationally efficient baselines, reserving DL approaches for scenarios where sequence modeling or payload analysis is indispensable.
- Assessing transferability by augmenting GeNIS-derived models with additional modern datasets and challenging them with adversarial variations.
7. Application Scenarios and Research Impact
The GeNIS dataset is positioned as a high-fidelity, contemporary benchmark for research and real-world development of AI-based NIDS, particularly in SME contexts lacking access to proprietary traffic logs (Silva et al., 11 Nov 2025). Its mixture of realistic benign and attack behaviors, modular design, and exclusion of non-generalizable fields makes it well-suited for benchmarking, model selection, and algorithmic ablation studies. The explicit protocol for training/testing splits, feature selection pipeline, and public description of model baselines enable reproducible experimentation and meaningful comparison across studies.
A plausible implication is that GeNIS serves both as a direct benchmarking tool and as a template for the construction of future datasets prioritizing behavioral realism and cross-environment generalization. Researchers are advised to use the dataset not only to establish performance baselines but also to probe the limits of generalization, transfer learning, and adversarial robustness using standardized, reproducible protocols.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free