CICIDS2017: Network Intrusion Dataset

Updated 6 March 2026

CICIDS2017 is a comprehensive, flow-based intrusion detection benchmark with 2.8 million records and 79–80 features covering benign and 14 attack classes.
Robust preprocessing including cleaning, normalization, and dimensionality reduction is essential to manage its noise, redundancy, and class imbalance.
The dataset supports varied experimental setups with traditional ML and deep learning models, emphasizing real-world validation and interpretability.

The CICIDS2017 dataset is a large-scale, flow-based intrusion detection benchmark developed to address key limitations of legacy datasets and provide a realistic, well-annotated corpus for evaluating both classical and contemporary network intrusion detection systems. It captures heterogeneous benign and attack traffic at the NetFlow level, supporting both binary and fine-grained multi-class classification, with widely documented characteristics including strong class imbalance, varied attack scenarios, and an extensive feature set derived from actual enterprise traffic.

1. Composition, Structure, and Label Taxonomy

CICIDS2017 comprises approximately 2.8 million labeled network flow records, each described by 79–80 features extracted from per-packet and per-flow metadata, including, but not limited to, duration, packet counts, byte totals, inter-arrival times, protocol flags, and statistical summaries. Each record is labeled as benign or assigned to one of up to 14 attack types (depending on preprocessing and aggregation). Canonical attack classes encompass:

Brute-force FTP/SSH
Denial-of-Service (DoS) and Distributed DoS (DDoS)
Heartbleed
Web-based exploits (e.g., SQL injection, XSS)
Infiltration
Botnet activity
Port scans

The dataset is organized as multiple CSV files corresponding to separate capture days, each encoding specific scenarios with precisely timed attack intervals. Flows cover both internal and external endpoints, with bidirectional aggregations and metadata supporting both direction-agnostic and directional analyses (Mutalib et al., 12 Nov 2025, Rababah et al., 2020, Ahmim et al., 2018, Corsini et al., 2021, Jalalvand et al., 23 Jun 2025).

2. Preprocessing and Feature Engineering Practices

Robust preprocessing is essential due to the dataset's scale, heterogeneity, and potential for noise or redundancy. Standard modern pipelines implement the following:

Cleaning: All flows with missing, ‘Infinity’, or ‘NaN’ values are dropped to prevent training instabilities. Features constant across all records are removed to minimize redundant dimensions (Ahmim et al., 2018).
Imputation: Where applied, missing numeric features are mean-imputed; categorical features are label-encoded or one-hot encoded according to downstream model requirements (Mutalib et al., 12 Nov 2025, Corsini et al., 2021, Jalalvand et al., 23 Jun 2025).
Normalization: Continuous features are min-max normalized per

$x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

This operation is universally advocated to ensure stable model convergence and comparability of feature scales (Mutalib et al., 12 Nov 2025, Ahmim et al., 2018).

Dimensionality Reduction/Feature Selection: Approaches include recursive feature elimination with Random Forest importance (yielding top-20 subsets), principal component analysis (retaining ≥95% variance via 6–12 principal components), and information gain filtering (thresholds as low as IG>0.4 resulting in 10–20 attributes) (Mutalib et al., 12 Nov 2025, Rababah et al., 2020, Jalalvand et al., 23 Jun 2025).
Label Handling: Both binary (benign vs. attack/malicious) and multi-class labelings are supported, with some pipelines remapping attack flows to severity categories using external sources (e.g., CVSS/NIST) (Jalalvand et al., 23 Jun 2025).

3. Experimental Paradigms and Downstream Modeling

The dataset enables a diversity of experimental setups spanning both binary and multi-class network intrusion detection as well as alert prioritization.

Train-Test Splits: Researchers commonly employ stratified 80/20 or random 57/43 splits, or scenario-based 70/30 splits per attack window, ensuring proportional representation of normal and attack flows across train/test folds (Mutalib et al., 12 Nov 2025, Rababah et al., 2020, Corsini et al., 2021, Ahmim et al., 2018).
Class Imbalance Approaches: While oversampling methods such as SMOTE are rarely applied (Mutalib et al., 12 Nov 2025), the class imbalance is addressed via stratified sampling, label weighting in loss functions, and careful scenario curation.
Model Classes: Architectures range from traditional ML (Random Forest, Decision Tree, k-NN, Naive Bayes, FURIA, JRip, Forest PA) to deep learning (feedforward neural networks, CNNs, LSTMs, stacking and ensemble paradigms), as well as reinforcement learning agents for alert prioritization (Ahmim et al., 2018, Rababah et al., 2020, Gueriani et al., 2024, Corsini et al., 2021, Mutalib et al., 12 Nov 2025, Jalalvand et al., 23 Jun 2025).
Input Preparation: Pipelines uniformly reshape features to match model-specific interface requirements (e.g., (n, 45, 1) tensors for CNN-LSTM (Gueriani et al., 2024)), process categorical attributes accordingly, and pass principal components or reduced feature sets where applicable (Mutalib et al., 12 Nov 2025, Jalalvand et al., 23 Jun 2025).
Scenario-Based Sequencing: For temporal models, flows are grouped by attack window, processed as sequence data suitable for LSTM or hybrid models (Corsini et al., 2021).

4. Performance Metrics and Quantitative Results

Consistent evaluation is realized using standard metrics derived from confusion matrices:

Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$
Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F1-score: $2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$
False Positive Rate: $\frac{FP}{FP + TN}$

Reported results across methodologies include:

Model/Approach	Accuracy	Precision	Recall	F1-score	FPR
RF-RFE + SHAP (Mutalib et al., 12 Nov 2025)	0.9997	0.9384	0.8716	0.9038	—
Hybrid Stacking (DT+RF) (Rababah et al., 2020)	0.980	0.980	0.980	0.980	0.069
Hierarchical IDS (Ahmim et al., 2018)	0.96665	—	0.9448	—	0.01145
LSTM (Seq) (Corsini et al., 2021)	—	—	—	0.99669	—
CNN–LSTM (unseen test) (Gueriani et al., 2024)	0.9746	0.9717	0.9715	0.9709	0.0208
L2DHF (RL-AI assist) (Jalalvand et al., 23 Jun 2025)	0.996 (AP)	—	—	—	[reduction: 52–100%]

In multi-class and adversarial scenarios, performance varies by attack rarity, with PortScan, Heartbleed, and Infiltration often achieving >99% detection rates, while stealthy classes such as Botnet or Web-XSS yield lower detection due to overlap and class imbalance (Ahmim et al., 2018, Rababah et al., 2020, Corsini et al., 2021).

5. Interpretability, Explainability, and Feature Contributions

Combining predictive accuracy with explainability is increasingly prioritized. Methods such as recursive feature elimination and SHapley Additive exPlanations achieve both model parsimony and interpretability:

Recursive Feature Elimination (RFE) with Random Forests selects 20/80 features without F1 loss; leading features include Init_Win_bytes_backward/forward, Avg_Bwd_Segment_Size, Flow_Duration, Average_Packet_Size (Mutalib et al., 12 Nov 2025).
SHAP analysis quantifies the contribution of individual features to model decisions, with high values of Init_Win_bytes_backward and Flow_Duration strongly predicting benign flows, while anomalous segments indicate attack likelihood.
Local explanations demonstrate that several features collectively influence per-flow verdicts, supporting transparent operator review (Mutalib et al., 12 Nov 2025).

6. Use Cases and Practical Considerations

CICIDS2017's broad utility is evidenced by diverse usage profiles:

Generalization Testing: Models trained on newer or domain-specific datasets (e.g., CICIoT2023) are commonly validated on CICIDS2017 as an unseen test set to benchmark cross-domain generalization (Gueriani et al., 2024).
Alert Prioritization: Integration with SOC overlays, where alerts are mapped to severity bands (CVSS graded) and pipelines such as Learning to Defer with Human Feedback (L2DHF) use reinforcement learning to maximize correct prioritization and minimize analyst workload (Jalalvand et al., 23 Jun 2025).
Scenario-Driven and Streaming Evaluation: The scenario-based organization allows for precise emulation of attack campaigns, and the volume of flows enables simulated streaming/high-volume inference (Corsini et al., 2021, Jalalvand et al., 23 Jun 2025).

Pipeline recommendations include always removing problem flows (NaN/Infinity), normalizing features, employing stratified splits, and reporting both overall and per-class detection rates.

7. Strengths, Limitations, and Future Directions

Strengths: CICIDS2017's annotated heterogeneity, feature richness, and temporal ground-truth for attacks underpin its dominance as a benchmark. High detection accuracies (often >98–99%) are consistently reproducible with both machine learning and deep learning models (Mutalib et al., 12 Nov 2025, Rababah et al., 2020, Ahmim et al., 2018, Corsini et al., 2021).
Limitations: Severe class imbalance, nontrivial redundancy among features, and the absence of payload data constrain extremely rare attack detection and limit some forms of adversarial evaluation (Ahmim et al., 2018, Mutalib et al., 12 Nov 2025).
Open Directions: Future work includes fine-tuning feature selection and normalization for CICIDS2017-specific patterns, extending binary detectors to multi-class frameworks, incorporation of attention/transformer architectures for long-term dependencies, and hardware-based real-time streaming evaluations (Gueriani et al., 2024, Ahmim et al., 2018).

A plausible implication is that hybrid or ensemble architectures—integrating static and sequential deep learning, interpretable ML, and human-in-the-loop mechanisms—will further advance the practical and theoretical value derived from CICIDS2017 as a central benchmark in network intrusion detection research.