Papers
Topics
Authors
Recent
2000 character limit reached

GeNIS Dataset: Benchmark for NIDS

Updated 13 November 2025
  • GeNIS Dataset is a modular, flow-based benchmark that simulates realistic network traffic with both benign and multi-stage attack campaigns.
  • The dataset applies rigorous preprocessing and feature selection, ensuring robust evaluation of machine learning and deep learning models for network intrusion detection.
  • Baseline models, including Random Forest, XGBoost, and LightGBM, achieved near-perfect F1 scores, demonstrating the dataset’s practical impact in modern cybersecurity research.

The GeNIS dataset is a modular, flow-based benchmark designed to support artificial intelligence-driven network intrusion detection systems (NIDS) in small-to-medium enterprise (SME) environments. It was specifically created to address the lack of diverse, recent, and behaviorally rich network datasets necessary for reliable evaluation and training of ML and deep learning (DL) models in cybersecurity contexts (Silva et al., 11 Nov 2025). GeNIS simulates realistic user and administrator activity as well as eight multi-stage attack campaigns, providing comprehensive time-based and quantity-based behavioral features to enable robust cyberattack classification.

1. Collection Methodology and Dataset Composition

GeNIS was constructed by scripting “user” and “admin” behaviors over working weekdays, alongside generation of idle background traffic on weekends. Two Kali Linux attacker machines, placed in separate subnets, executed multi-stage attack scenarios including reconnaissance (e.g., port scanning, banner grabbing), bruteforce authentication attempts (SSH, FTP, Active Directory), and diverse denial-of-service (SYN floods, HTTP floods, DNS amplification). This design ensures that both legitimate and malicious activity are rooted in realistic operational context.

Raw packets were collected and aggregated into flows using the HERA flow exporter over 60-second intervals, yielding an initial set of 125 features per flow (3 ground-truth labels and 122 behavioral/contextual fields). Post-processing included removal of empty data, exclusion of topology-specific identifiers (such as IP addresses, MAC addresses, and VLAN tags), and one-hot encoding of the “State,” “Flags,” and “Protocol” fields for categorical coverage. The final feature set per flow comprises 87 numeric fields. The data splits are as follows:

Split Training Flows Testing Flows
Binary 294,844 73,712
Multi 294,844 73,712

Attack–class distributions for multiclass tasks are dominated by Denial-of-Service flows, with smaller but substantial representation for reconnaissance and bruteforce campaigns.

2. Feature Taxonomy and Definitions

Behavioral features in GeNIS are organized into five groups, with the dataset emphasizing time-based and quantity-based metrics for flow characterization. Each flow is modeled as a sequence of NN packets p1,,pNp_1, \ldots, p_N, each packet arriving at timestamp tit_i, size bib_i bytes, and direction diri{dir_i \in \{src\rightarrowdst, dst\rightarrowsrc}\}.

Time-Based Features (29 extracted):

  • Flow duration: Dur=tNt1Dur = t_N - t_1
  • Packet inter-arrival statistics:
    • min2iN(titi1)\min_{2\leq i\leq N} (t_i-t_{i-1}), max2iN(titi1)\max_{2\leq i\leq N} (t_i-t_{i-1})
    • Mean, sum of inter-packet intervals
  • Runtime vs idle time: RunTime=i:ΔtiτΔtiRunTime = \sum_{i:\Delta t_i\leq\tau} \Delta t_i (for inactivity threshold τ\tau)
  • Start time (timestamp of p1p_1)

Quantity-Based Features (38 extracted):

  • Packet and byte counts/directions:
    • TotPkts=NTotPkts = N; SrcPktsSrcPkts, DstPktsDstPkts
    • TotBytes=biTotBytes = \sum b_i; SrcBytesSrcBytes, DstBytesDstBytes
  • TCP-level and application-layer volumes:
    • SrcTCPBase=hdrTCPiSrcTCPBase = \sum hdrTCP_i, DstTCPBase=hdrTCPiDstTCPBase = \sum hdrTCP_i
    • SAppBytesSAppBytes, DAppBytesDAppBytes
  • Initial TCP window sizes: SrcWinSrcWin, DstWinDstWin
  • Composite rates:
    • Byte rate: Rate=TotBytes/DurRate = TotBytes / Dur
    • Packet rate: Load=TotPkts/DurLoad = TotPkts / Dur
    • Directional rates: SrcRateSrcRate, DstRateDstRate, SrcLoadSrcLoad, DstLoadDstLoad

This structure facilitates detection of behavioral anomalies that may signal sophisticated attacks, and by removing topology-dependent features, supports dataset transferability and generalization.

3. Attack Classes, Labeling, and Split Distributions

GeNIS supports both binary (Benign vs Malicious) and multiclass (Benign, DoS, Reconnaissance, Bruteforce) classification tasks. Labeling was performed at the flow level, using the operational context of each flow to determine the ground truth category.

Class Training Flows Testing Flows Distribution (%) (Train)
DoS 236,512 59,128 80.22
Recon 22,186 5,547 7.52
Bruteforce 14,426 3,607 4.89
Benign 21,720 5,430 7.37

In binary mode, “Malicious” encompasses any of the three attack categories. This class balance reflects real-world attack patterns, notably the predominance of DoS-like activity, while maintaining sufficient representation in other categories for robust multiclass training and benchmarking.

4. Preprocessing and Feature Selection Protocols

The dataset’s authors implemented a rigorous preprocessing pipeline:

  • Automated removal of empty rows/columns.
  • Exclusion of topology/contextual fields to prevent overfitting and hardcoded IP-based shortcuts.
  • One-hot encoding for categorical variables and feature standardization (zero mean, unit variance).

For feature selection, five statistical techniques were independently computed and their normalized importance scores aggregated:

  1. Information Gain: IG(Xf)=H(Y)H(YXf)IG(X_f) = H(Y) - H(Y|X_f), quantifying the entropy reduction from knowing feature XfX_f.
  2. Chi-Squared Test: χf2=(Oi,cEi,c)2/Ei,c\chi^2_f = \sum (O_{i,c} - E_{i,c})^2/E_{i,c} over bins ii and classes cc.
  3. Recursive Feature Elimination (RFE): Iterative backward elimination using a tree-ensemble base estimator.
  4. Mean Absolute Deviation (MAD): MADf=1Ni=1Nxi,fμfMAD_f = \frac{1}{N}\sum_{i=1}^N |x_{i,f} - \mu_f|.
  5. Dispersion Ratio (DR): DRf=ηf2DR_f = \sqrt{\eta_f^2}, where ηf2\eta_f^2 quantifies between-class variance.

The union of the top 16 features by cumulative score (approximately 70% cumulative importance) formed the behavior-focused final subsets for both classification regimes.

5. Baseline Classification Results and Model Architectures

Five classifiers were validated: three tree ensemble models (Random Forest, XGBoost, LightGBM) and two DL architectures (LSTM, MLP). Representative configurations:

  • Random Forest: 100 trees, depth=16, min_samples_leaf=1
  • XGBoost: 100 estimators, learning rate η=0.2\eta=0.2, max_depth \in [4,16], colsample_bytree = [0.8,0.9]
  • LightGBM: 100 trees, η=0.05\eta=0.05, num_leaves=15, feature_fraction=0.8
  • LSTM: 64/128 hidden units, dense output, dropout=0.2, Adam optimizer (lr=0.001lr=0.001), batch=32, up to 30 epochs
  • MLP: 2 hidden dense layers (64/128, 32/64), identical training as LSTM

All inputs were standardized.

Classification Performance Metrics (best on all 55 features):

Task Model F1 (%) Accuracy (%) FPR (%)
Binary RF/XG 99.995 99.99 0.13
LGBM 99.99
DL ~99.99 0.13
Multiclass RF 99.982 99.992 0.092
XGB 99.982
LGBM 99.976
LSTM 99.966
MLP 99.932

Dimensionality reduction to 51 features resulted in <<0.1% reduction in F1F_1, with approximately 2x faster inference and halved tree training time; DL models benefited as well, though with less pronounced epoch-time savings.

6. Quality Assessment, Generalization, and Practical Best Practices

GeNIS’ use of HERA for flow extraction corrects deficiencies noted in older datasets using CICFlowMeter. By eliminating identifiers specific to a single network context, the dataset ensures that ML/DL models learn behavioral characteristics rather than spurious correlations with fixed addresses, a crucial property for generalization to unseen environments. High F1F_1 and low false positive rates (FPR) on both binary and multiclass splits reflect strong differentiation between benign and malicious activity reflective of modern threat profiles.

Tree ensemble models, particularly LightGBM, optimize accuracy–latency trade-offs, while neural networks offer comparable accuracy at the cost of 3–10x slower training and inference. A 16–20% feature reduction (e.g., 55 to 51) enables roughly 50% reduction in tree ensemble training time and 30% in inference time, with negligible impact on generalization. This suggests that streamlined feature sets are practical for deployment scenarios requiring low-latency detection.

Best practices, as derived from the reported usage and ablation studies, include:

  1. Leveraging the modular structure of GeNIS to select subsets tailored to specific services or attack profiles.
  2. Rigorous preprocessing—removal of context-dependent fields, careful encoding, and input standardization—is essential for robust cross-network deployment.
  3. Utilizing multiple, complementary feature selection strategies to focus model attention on behavioral signatures.
  4. Employing ensemble models as computationally efficient baselines, reserving DL approaches for scenarios where sequence modeling or payload analysis is indispensable.
  5. Assessing transferability by augmenting GeNIS-derived models with additional modern datasets and challenging them with adversarial variations.

7. Application Scenarios and Research Impact

The GeNIS dataset is positioned as a high-fidelity, contemporary benchmark for research and real-world development of AI-based NIDS, particularly in SME contexts lacking access to proprietary traffic logs (Silva et al., 11 Nov 2025). Its mixture of realistic benign and attack behaviors, modular design, and exclusion of non-generalizable fields makes it well-suited for benchmarking, model selection, and algorithmic ablation studies. The explicit protocol for training/testing splits, feature selection pipeline, and public description of model baselines enable reproducible experimentation and meaningful comparison across studies.

A plausible implication is that GeNIS serves both as a direct benchmarking tool and as a template for the construction of future datasets prioritizing behavioral realism and cross-environment generalization. Researchers are advised to use the dataset not only to establish performance baselines but also to probe the limits of generalization, transfer learning, and adversarial robustness using standardized, reproducible protocols.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GeNIS Dataset.