A Survey of Network-based Intrusion Detection Data Sets

Published 6 Mar 2019 in cs.CR | (1903.02460v2)

Abstract: Labeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying packet- and flow-based network data in detail. The paper identifies 15 different properties to assess the suitability of individual data sets for specific evaluation scenarios. These properties cover a wide range of criteria and are grouped into five categories such as data volume or recording environment for offering a structured search. Based on these properties, a comprehensive overview of existing data sets is given. This overview also highlights the peculiarities of each data set. Furthermore, this work briefly touches upon other sources for network-based data such as traffic generators and traffic repositories. Finally, we discuss our observations and provide some recommendations for the use and creation of network-based data sets.

Abstract PDF Upgrade to Chat

Citations (513)

View on Semantic Scholar

Summary

The paper systematically categorizes 34 NIDS data sets based on 15 properties including data volume, recording environment, and evaluation criteria.
The paper highlights the gap in fully labeled, publicly available data sets and stresses using multiple sets to prevent overfitting.
The paper recommends predefined training/test splits and collaborative data sharing to improve the robustness and comparability of NIDS evaluations.

Survey of Network-based Intrusion Detection Data Sets

This paper presents a comprehensive survey of data sets used in network-based intrusion detection systems (NIDS). It focuses on the necessity of labeled data sets for training and evaluating anomaly-based NIDS and describes the distinct packet- and flow-based data formats. The authors identify 15 properties to assess data set suitability for specific evaluation scenarios. These properties are classified into categories such as data volume, recording environment, and more, providing a structured overview and offering a detailed look at existing data sets.

Key Contributions

Detailed Categorization: The study identifies and evaluates data sets across 15 properties, grouped into categories like General Information, Nature of the Data, Data Volume, Recording Environment, and Evaluation. This systematic categorization aids in assessing data set suitability for various intrusion detection scenarios.
Comprehensive Overview: The authors provide an exhaustive review of 34 network-based intrusion detection data sets. It highlights each data set's peculiarity and attack types, offering valuable insights for researchers seeking appropriate data sets for their projects.
Observations and Recommendations: Several key observations are made regarding the creation and use of data sets:
- Absence of a Perfect Data Set: Realistically, no data set can embody all desired attributes due to the evolving nature of attack scenarios and data privacy concerns.
- Use of Multiple Data Sets: To avoid overfitting and ensure broader applicability, the authors recommend evaluating intrusion detection methods on multiple data sets.
- Predefined Subsets: Providing specific training and test splits facilitates evaluation comparison across different approaches.
- Data Set Publication and Anonymization: Emphasis is placed on public availability and careful anonymization to maintain utility while preserving privacy.

Implications and Future Directions

The survey offers critical insights into the current state of network-based intrusion detection data sets, emphasizing the scarcity of comprehensive, labeled, publicly available data. This scarcity presents a significant obstacle in the advancement of NIDS research. The paper calls for increased collaboration within the research community to develop and share standardized data sets that can be updated and reused for various intrusion detection tasks.

Additionally, the introduction of traffic generators and data repositories as alternative data sources fosters novel avenues for research and experimentation. Integrating these sources with publicly available sets could enhance the robustness of evaluation methodologies.

Conclusion

The paper fills a crucial gap by providing a structured framework for understanding and classifying network-based intrusion detection data sets. It urges the community to prioritize sharing data, resources, and methodologies to foster innovation in intrusion detection systems. While the "perfect" data set remains elusive, strategic use of existing and new data sources offers a path forward for developing more resilient NIDS solutions.

Markdown