CSE-CIC-IDS2018 Dataset

Updated 29 December 2025

CSE-CIC-IDS2018 dataset is a comprehensive network intrusion detection benchmark offering diverse, labeled traffic for ML and DL evaluation.
It provides detailed flow records and annotated PCAPs captured from a synthetic enterprise network emulation on Amazon AWS.
The dataset supports rigorous model testing, addressing class imbalance and feature redundancy for robust anomaly and attack detection.

The CSE-CIC-IDS2018 dataset is a large-scale, high-fidelity network intrusion detection benchmark designed to support ML research into attack detection and classification. Jointly developed by the Canadian Institute for Cybersecurity (CIC) and Canada’s Communications Security Establishment (CSE), the dataset features extensive annotated PCAPs and extracted flow records from a cloud-based enterprise emulation. Its primary objective is to provide realistic, diverse, and labeled traffic encompassing a broad range of modern attacks, specifically targeting the rigorous evaluation of ML- and DL-based intrusion detection systems.

1. Architecture and Data Collection Paradigm

The CSE-CIC-IDS2018 dataset was constructed using a synthetic enterprise network topology deployed on Amazon AWS, comprising approximately 420 virtual clients, around 30 servers (hosting services such as SSH, FTP, HTTP/S, SMTP, DNS), and 50 IoT-style devices running background activity (Tripathy et al., 3 Jun 2025, Atefinia et al., 2022). The testbed was split into multiple subnets representing functional departments. Attack traffic was generated from 50 dedicated attacker VMs running Kali Linux distributions. Benign traffic was produced by replaying session traces and synthetic user activity, while attacks were orchestrated according to a pre-defined schedule, covering multiple days in February-March 2018 (Cantone et al., 2024, Soltani et al., 2020, Atefinia et al., 2022).

Traffic on the network was captured in raw PCAP format at core switches. Post-capture, network packet streams were processed by CICFlowMeter-v3, producing bidirectional flow records per TCP/UDP 5-tuple. Each attack scenario was captured in isolation alongside regular traffic, with some days dedicated to rare attacks (e.g., SQL Injection, Infiltration) (Atefinia et al., 2022, Menssouri et al., 9 Feb 2025).

2. Attack Scenarios and Dataset Composition

The annotation schema encompasses up to 15 classes: 14 labeled attack types (e.g., FTP-BruteForce, DoS-Hulk, DDoS-LOIC-HTTP, Botnet, WebAttack-XSS, SQL-Injection) and benign traffic (Cantone et al., 2024, Chua et al., 2022). Per-class prevalence is highly imbalanced; benign flows account for 77–83% depending on the extraction procedure, with certain attacks (WebAttack-SQL-Injection, WebAttack-XSS) observed at <0.001% (Tripathy et al., 3 Jun 2025, Cantone et al., 2024, Chua et al., 2022).

Category	Example Count	Proportion (%)
Benign	13.39M	83.0
DoS-Hulk	0.46M	2.9
DDoS-HOIC	0.69M	4.3
Botnet	0.29M	1.8
FTP-BruteForce	0.19M	1.2
Infiltration	0.16M	1.0
WebAttack-XSS	0.0002M	0.001
WebAttack-SQL-Inj	0.00009M	0.0005
...	...	...

Post-processed flow record totals vary by extraction method, ranging from ~1.25M (reviewed subset) (Tripathy et al., 3 Jun 2025) to ~16M (full CSV aggregation) (Cantone et al., 2024). The class labels are derived via attack script scheduling and IP/port matching, with additional manual verification for high-quality ground truth (Tripathy et al., 3 Jun 2025).

3. Feature Extraction and Representation

The canonical CSE-CIC-IDS2018 CSV files contain 78–84 per-flow features, extracted by CICFlowMeter-v3 (Atefinia et al., 2022, Cantone et al., 2024, Sarhan et al., 2021). Features span:

Flow statistics: Duration $= t_\text{end} - t_\text{start}$ , total packet/byte counts in each direction, per-direction means, maxima, standard deviations.
Byte- and header-level aggregates: Fwd/Bwd header lengths, window sizes, segment and subflow sizing, flag counters (SYN, FIN, PSH, etc.).
Temporal measures: Inter-arrival times (mean, min, max, std), active/idle periods, bulk transfer metrics.
Derived ratios: e.g., Down/Up ratio $= \text{TotalBwdBytes} / \text{TotalFwdBytes}$ .

Features are all numeric; identifiers (IP, port) are typically dropped during ML preprocessing to prevent overfitting (Cantone et al., 2024, Chua et al., 2022, Sarhan et al., 2020). Units are either integers (counts) or real-valued (bytes, milliseconds, seconds, rates). The feature set enables the construction of aggregated statistics meaningful for a wide variety of traffic patterns and attacks (Soltani et al., 2020, Atefinia et al., 2022, Sarhan et al., 2021, Göcs et al., 2023).

Alternative lightweight versions such as NF-CSE-CIC-IDS2018 reduce the feature set to 8–12 NetFlow fields (e.g., in/out bytes/packets, protocols, TCP flags, duration) for computational efficiency (Sarhan et al., 2020, Sarhan et al., 2021).

4. Preprocessing, Labeling, and Partitioning Strategies

Data cleaning procedures consistently involve dropping flows with NaNs, inf, or negative values, removing columns of zero variance, and omitting timestamps and identifiers (Andrecut, 2022, Göcs et al., 2023, Chua et al., 2022). Features are normalized either via min–max or z-score scaling:

$x' = \frac{x - x_\text{min}}{x_\text{max} - x_\text{min}}\quad\text{or}\quad x' = \frac{x - \mu}{\sigma}$

Ten- or five-fold stratified cross-validation is the prevailing partitioning scheme; some variants focus on per-class balancing through under- or oversampling (e.g., subsampling benign traffic to address class imbalance) (Tripathy et al., 3 Jun 2025, Chua et al., 2022, Andrecut, 2022). Several studies exclude rare classes (infiltration, SQL-injection, XSS) from binary classification tasks due to extremely low prevalence (Zhong et al., 2022, Chua et al., 2022).

Labeling is performed by correlating flow timing/IP/port data with the attack scheduler, with spot audits achieving over 99% accuracy (Tripathy et al., 3 Jun 2025, Cantone et al., 2024). For anomaly detection, labels are collapsed into a binary attack/benign format; multi-class labels are retained for more granular detection (Menssouri et al., 9 Feb 2025, Zhong et al., 2022). Some works (e.g., (Soltani et al., 2020)) apply further aggregation into high-level attack families.

5. Feature Selection and Reduction Approaches

A range of filter- and wrapper-based feature selection techniques have been evaluated. Key methods include:

Statistical filters: Chi-square, information gain, correlation coefficient, and ANOVA F-score; top features typically include ACK Flag Count, Fwd Seg Size Min, Init Fwd Win Byts, RST Flag Count, ECE Flag Count (Sarhan et al., 2021, Göcs et al., 2023).
Relief and gain ratio: Providing rankings robust to noisy labels and multiclass setups (Göcs et al., 2023).
Learner-based/embedded selection: Random forest permutation importance, “learner-based analyzers,” and end-to-end accuracy-driven selection (e.g., tree-based selection for minimal optimal feature sets) (Chua et al., 2022, Atefinia et al., 2022, Göcs et al., 2023).

Empirically, only 3–10 features are required to approach full-model accuracy (>97%) for most binary attack types (Göcs et al., 2023, Sarhan et al., 2021); e.g., for FTP-BruteForce, 8 features suffice for $F_1=1.00$ in a Random Forest classifier, and for SSH-BruteForce, 7 features yield $F_1=0.99999$ (Göcs et al., 2023).

6. Representative Machine Learning Workflows and Benchmarks

CSE-CIC-IDS2018 supports a spectrum of ML and DL approaches:

Supervised Learning: Logistic regression, SVM, (Gradient) Decision Trees, Naive Bayes, Random Forests, Artificial/Deep Neural Nets, CNNs, LSTMs. Many studies demonstrate high training performance (accuracy $>$ 99%) but observe notable performance drop under cross-dataset settings or on rare attacks (Tripathy et al., 3 Jun 2025, Chua et al., 2022, Göcs et al., 2023).
Anomaly Detection: Outlier detectors (MND, COPOD, AutoEncoder, etc.) have been tested on process-mining-derived features (Zhong et al., 2022).
Online/Adaptive Analysis: Process mining approaches transform TCP flag transitions into time-windowed adjacency matrices, providing strong online F₁-scores (e.g. >0.94 for many attack types) and competitive or superior AUC relative to traditional flows (Zhong et al., 2022).
Augmentation and Class Imbalance: GAN-based solutions (CTGAN), SMOTEENN, and focal loss are deployed to synthetically rebalance and improve rare-event detection, notably increasing rare class F₁ from $<$ 0.8 to $>$ 0.84 without loss of global accuracy (Menssouri et al., 9 Feb 2025).
Content-based Detection: DNN frameworks ingesting packet payload bytes demonstrate robust attack/benign discrimination (Precision/Recall/F₁ $\simeq$ 0.93) (Soltani et al., 2020).

NetFlow variants provide near-comparable binary detection rates but substantially degraded multi-class discrimination (e.g. drop from F₁=0.94 to F₁=0.80) (Sarhan et al., 2020, Sarhan et al., 2021).

7. Limitations, Known Quality Issues, and Best Practices

The CSE-CIC-IDS2018 dataset exhibits several documented limitations:

Labeling and Feature Calculation Errors: Automatic labeling (by time windows) can misassign classes, with audits revealing up to 7.5% mislabeled flows and systemic feature miscalculations inherited from earlier datasets (duplicated features, erroneous directionality, inconsistent flow terminations) (Cantone et al., 2024).
Class Imbalance and Rarity Effects: Severe under-representation of attacks like SQL-Injection, Infiltration, and WebAttack-XSS necessitates careful resampling or augmentation for robust ML (Menssouri et al., 9 Feb 2025).
Cross-dataset Generalizability: Out-of-box models trained/tested on CSE-CIC-IDS2018 yield “nearly perfect” accuracy, yet exhibit performance near random-chance when validated against external datasets (e.g., LycoS-IDS-2017), indicating distribution shift and data leakage (Cantone et al., 2024). Correction workflows (LycoS-Unicas-IDS2018) and re-extracted features are recommended for genuine transferability.
Feature Redundancy: High collinearity and redundancy in the feature set mean most ML tasks succeed with fewer than 10 features, provided they are carefully selected (Göcs et al., 2023, Sarhan et al., 2021).
Potential Hidden Labels: Care must be taken to remove fields such as deterministic ports or network-layer identifiers that may unrealistically separate classes (Sarhan et al., 2021).

Best Practices: Use robust feature selection, address class imbalance via oversampling/augmentation, remove identifiers and timestamps, and validate with stratified and, when possible, cross-dataset splits. Always visually inspect per-class feature distributions for small or degenerate clusters indicative of leakage or mislabeling (Cantone et al., 2024, Sarhan et al., 2021, Göcs et al., 2023).