MLRan: A Behavioural Dataset for Ransomware Analysis and Detection

Published 24 May 2025 in cs.CR and cs.LG | (2505.18613v1)

Abstract: Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several ML models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

An Academic Exploration of 'MLRan: A Behavioural Dataset for Ransomware Analysis and Detection'

The paper "MLRan: A Behavioural Dataset for Ransomware Analysis and Detection" presents a comprehensive study in the field of cybersecurity with a focus on ransomware detection. The authors address a critical challenge in the development of effective machine learning models for ransomware detection—the scarcity and limitations of publicly available datasets. This study introduces the MLRan dataset, a dedicated resource that captures dynamic ransomware behavior across a wide range of families and types. The authors enhance the state of ransomware research by proposing a robust dataset alongside concrete guidelines for constructing high-quality behavioral datasets, significantly contributing to the machine learning community aiming to improve the fidelity and precision of ransomware detection mechanisms.

Dataset Composition and Methodological Insights

The MLRan dataset represents a significant stride in ransomware research, compiling over 4,800 samples distributed across 64 ransomware families with a balanced representation of goodware samples. Covering ransomware types such as locker, crypto, ransomware-as-a-service (RaaS), and modern variants, this dataset spans from 2006 to 2024, effectively representing diverse ransomware evolutions. The guideline proposed for constructing behavioral datasets, termed GUIDE-MLRan, provides methodological insights into curation, ensuring that datasets meet criteria like sample diversity, representativeness, feature extraction, and reproducibility. These guidelines are grounded on existing literature, augmenting the methodological foundation for future dataset construction endeavors.

Technical Approach and Evaluation

To ensure the dataset’s applicability for machine learning tasks, dynamic behavioral feature extraction is performed using an automated sandbox pipeline, reducing millions of initial features to epidemiologically relevant data. The feature selection process utilizes mutual information filtering and recursive feature elimination yielding 483 highly informative features, preserving model performance integrity. Empirically, models evaluated using MLRan achieved impressive accuracy (up to 98.7%), precision (up to 98.9%), and recall rates (up to 98.5%). The use of explainable AI tools such as SHAP and LIME allows the identification of critical features, elucidating the behavioral characteristics that distinguish ransomware, including registry tampering and API misuse.

Implications and Future Directions

The implications of the study are profound. Practically, MLRan provides a reliable resource for developing robust ransomware detection models. Theoretically, it enriches current understanding by delineating behavioral features indicative of ransomware activities. Future developments could explore integrating this dataset with emerging AI methodologies, such as unsupervised learning or anomaly detection frameworks, to further enhance detection capabilities against adaptive ransomware threats. The open-source nature of MLRan facilitates academic collaboration, encouraging further research into advanced detection models, potentially driving innovations in automated cybersecurity threat analysis.

In conclusion, this paper significantly enhances ransomware detection research by providing a substantive dataset alongside robust methodological guidelines. This foundational work paves the way for improved machine learning models and methodologies, fostering advanced ransomware detection systems capable of addressing evolving cyber threats in an increasingly digital world.

Markdown Report Issue