- The paper systematically categorizes 34 NIDS data sets based on 15 properties including data volume, recording environment, and evaluation criteria.
- The paper highlights the gap in fully labeled, publicly available data sets and stresses using multiple sets to prevent overfitting.
- The paper recommends predefined training/test splits and collaborative data sharing to improve the robustness and comparability of NIDS evaluations.
Survey of Network-based Intrusion Detection Data Sets
This paper presents a comprehensive survey of data sets used in network-based intrusion detection systems (NIDS). It focuses on the necessity of labeled data sets for training and evaluating anomaly-based NIDS and describes the distinct packet- and flow-based data formats. The authors identify 15 properties to assess data set suitability for specific evaluation scenarios. These properties are classified into categories such as data volume, recording environment, and more, providing a structured overview and offering a detailed look at existing data sets.
Key Contributions
- Detailed Categorization: The paper identifies and evaluates data sets across 15 properties, grouped into categories like General Information, Nature of the Data, Data Volume, Recording Environment, and Evaluation. This systematic categorization aids in assessing data set suitability for various intrusion detection scenarios.
- Comprehensive Overview: The authors provide an exhaustive review of 34 network-based intrusion detection data sets. It highlights each data set's peculiarity and attack types, offering valuable insights for researchers seeking appropriate data sets for their projects.
- Observations and Recommendations: Several key observations are made regarding the creation and use of data sets:
- Absence of a Perfect Data Set: Realistically, no data set can embody all desired attributes due to the evolving nature of attack scenarios and data privacy concerns.
- Use of Multiple Data Sets: To avoid overfitting and ensure broader applicability, the authors recommend evaluating intrusion detection methods on multiple data sets.
- Predefined Subsets: Providing specific training and test splits facilitates evaluation comparison across different approaches.
- Data Set Publication and Anonymization: Emphasis is placed on public availability and careful anonymization to maintain utility while preserving privacy.
Implications and Future Directions
The survey offers critical insights into the current state of network-based intrusion detection data sets, emphasizing the scarcity of comprehensive, labeled, publicly available data. This scarcity presents a significant obstacle in the advancement of NIDS research. The paper calls for increased collaboration within the research community to develop and share standardized data sets that can be updated and reused for various intrusion detection tasks.
Additionally, the introduction of traffic generators and data repositories as alternative data sources fosters novel avenues for research and experimentation. Integrating these sources with publicly available sets could enhance the robustness of evaluation methodologies.
Conclusion
The paper fills a crucial gap by providing a structured framework for understanding and classifying network-based intrusion detection data sets. It urges the community to prioritize sharing data, resources, and methodologies to foster innovation in intrusion detection systems. While the "perfect" data set remains elusive, strategic use of existing and new data sources offers a path forward for developing more resilient NIDS solutions.