Papers
Topics
Authors
Recent
2000 character limit reached

Recent IDS Datasets & Advances

Updated 7 January 2026
  • The research track introduces seven large-scale IDS datasets with diverse features, enabling rigorous evaluation across real and synthetic network environments.
  • It details methodologies that leverage header and flow data to support anomaly detection, signature-based, and deep learning security techniques.
  • The study outlines experimental pathways for cross-domain transfer, adversarial robustness, and IoT-smartphone convergence to enhance IDS performance.

Recent Datasets and Methodological Advances in Intrusion Detection Systems (IDS)

The landscape of Intrusion Detection Systems (IDS) research has been transformed in recent years by the development and publication of several large-scale, contemporary datasets reflecting modern network topologies, threat vectors, and application environments. These datasets are instrumental both for the creation and rigorous evaluation of novel IDS methodologies, with an increasing focus on IoT, mobile, and encrypted traffic scenarios (Jindal et al., 2021).

1. Overview of Contemporary IDS Datasets

Seven recently published datasets encapsulate diverse real-world and synthetic environments, providing rich contexts, high-volume traffic, and comprehensive feature sets. The table below summarizes core attributes:

Dataset Environment & Period Key Features/Coverage
TOR-nonTOR VirtualBox testbed, Tor gateway (2017) Tor/non-Tor flows, eight Tor-app labels, packet-header + flow features
CICAndMal2017 Real Android phones (2017) Benign + 42 malware families, 80 flow features, Android TCP/UDP flows
Bot-IoT Virtual IoT testbed, 5 device-types (2018) 72M NetFlow records, DoS/DDoS, scanning, keylogging, exfiltration
Network TON_IoT 3-tier IoT (edge–fog–cloud), 2021 Real/emulated devices, 70+ flow/header features, application-layer attacks
CIC DoS Single Linux server (2017) HTTP/HTTPS floods, low-rate DDoS, 75 L7/flow features
CICIDS2017 Multi-OS lab, attacker/victim LANs (2017) Seven modern attacks, simulated user profiles, 80 NetFlow features
UCSD Telescope /8 darknet telescope, 2008–present Massive scale, unsolicited traffic (scans, backscatter), packet headers

Each dataset targets specific gaps: privacy-preserving analysis (TOR-nonTOR), smartphone malware (CICAndMal2017), IoT attacks (Bot-IoT, TON_IoT), and large-scale anomaly screening (UCSD Telescope). Volume ranges from tens of megabytes (Tor) to multiple terabytes per day (darknet telescope), with features almost exclusively derived from packet/network headers or NetFlow logistic summaries. A salient trend is the underrepresentation of full-payload data due to privacy and compliance considerations (Jindal et al., 2021).

2. Taxonomy and Dataset Selection Guidance

To systematize dataset selection for new IDS research, the datasets are categorized as follows:

  • Tor Traffic Datasets: E.g., TOR-nonTOR, for privacy/anonymized traffic and Tor application fingerprinting.
  • Android Malware Datasets: E.g., CICAndMal2017, for smartphone malware and benign/hostile app discrimination.
  • IoT-Focused Datasets: Bot-IoT (virtual/synthetic), Network TON_IoT (real+emulated, edge–fog–cloud architectures).
  • Traditional Network Datasets: CIC DoS (application-layer DoS, low-rate DDoS), CICIDS2017 (multi-class attacks, realistic benign traffic), UCSD Telescope (unsolicited global traffic, backscatter).

The taxonomy is functionally aligned with key IDS research scenarios:

  • Real-time streaming detection: UCSD Telescope, CICIDS2017.
  • Anomaly-based algorithms: Bot-IoT, TON_IoT.
  • Signature- and classifier-driven techniques: CICAndMal2017, CIC DoS.
  • IoT/Smart Device Security: Bot-IoT, TON_IoT.
  • Privacy- and adversarial-traffic studies: TOR-nonTOR.

Selection thus depends on the desired experimental paradigm, protocol heterogeneity, and granular need for attack/benign subcategories.

Observed deficits in the current public corpus of IDS datasets include:

  • Absence of datasets reflecting realistic, benign smartphone usage patterns (beyond Android malware).
  • Inadequate representation of edge-network protocols (Bluetooth, ZigBee, LoRaWAN) and wearable IoT devices.
  • The near-total lack of payload-level capture, which limits the study of application-layer attacks and deep-packet context for ML-based detection.

Notably, a recurrent trade-off is identified between attack realism (retaining original IP addresses, full traffic) and the necessity for effective anonymization (e.g., stat-preserving prefix truncation, payload suppression), which may impact both privacy research and the potential for machine learning overfitting (Jindal et al., 2021).

Recommendations for future dataset publication include increasing feature diversity (headers, flows, selective payload, timing/side-channel), inclusion of contemporary encrypted traffic patterns (e.g., TLS 1.3, QUIC), and releasing both pcaps and interpreted flows (NetFlow/IPFIX). In addition, more multi-stage and blended (compound) attack scenarios are called for.

4. Quantitative Metrics and Standard Evaluation

No novel quantitative evaluation metrics are introduced in the survey, but standard IDS diagnostic measures are universally adopted:

  • Precision: TP/(TP+FP)TP / (TP + FP)
  • Recall: TP/(TP+FN)TP / (TP + FN)
  • F₁ Score: 2(precisionrecall)/(precision+recall)2 \cdot (\text{precision} \cdot \text{recall}) / (\text{precision} + \text{recall})
  • Accuracy: (TP+TN)/(TP+TN+FP+FN)(TP + TN)/(TP + TN + FP + FN)

where TPTP, FPFP, TNTN, FNFN denote true/false positive/negatives. These metrics are central to fair comparison across architectures, datasets, and study designs (Jindal et al., 2021).

5. Conclusions and Research Directions

Recent datasets exhibit substantially increased scale, heterogeneity, and real-world fidelity, particularly enabling deep learning architectures (e.g., CNN-LSTM on CICAndMal2017). IoT datasets (Bot-IoT, TON_IoT) display progress but still lack comprehensive protocol/device classes. Privacy-focused datasets (TOR-nonTOR) uniquely facilitate both defensive and side-channel analyses (e.g., Tor application fingerprinting). Darknet traffic resources (UCSD Telescope) remain central to anomaly screening and global threat intelligence.

Three experimental pathways are delineated for further research:

  • Cross-domain transfer learning: Pretrain on synthetic IoT flows (Bot-IoT) and adapt to real heterogeneous environments (TON_IoT), especially for anomaly detection under novel attack/device scenarios.
  • Adversarial robustness evaluation: Use TOR-nonTOR to train flow-based Tor application classifiers; generate and inject adversarial flows (e.g., timing or payload padding) to assess degradation, and explore defense via ensemble or randomized detection schemes.
  • Smartphone–IoT convergence studies: Merge benign flows from CICAndMal2017 with IoT traffic (Bot-IoT) to form cross-device datasets, permitting the construction and evaluation of multi-stage, multi-device IDS workflows simulating smart-home threats.

This survey’s synthesis of recent datasets—by feature support, traffic diversity, and ground-truth granularity—provides a rigorous foundation for ongoing advances in IDS methodology, with a strong emphasis on cross-domain modeling, adversarial evaluation, and the evolution of privacy-robust, real-world-representative benchmarks (Jindal et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Recently Published Research Track.