Smart OMVI: Obfuscated Malware Variant Identification using a novel dataset (2310.10670v1)
Abstract: Cybersecurity has become a significant issue in the digital era as a result of the growth in everyday computer use. Cybercriminals now engage in more than virus distribution and computer hacking. Cyberwarfare has developed as a result because it has become a threat to a nation's survival. Malware analysis serves as the first line of defence against an attack and is a significant component of cybercrime. Every day, malware attacks target a large number of computer users, businesses, and governmental agencies, causing billions of dollars in losses. Malware may evade multiple AV software with a very minor, cunning tweak made by its designers, despite the fact that security experts have a variety of tools at their disposal to identify it. To address this challenge, a new dataset called the Obfuscated Malware Dataset (OMD) has been developed. This dataset comprises 40 distinct malware families having 21924 samples, and it incorporates obfuscation techniques that mimic the strategies employed by malware creators to make their malware variations different from the original samples. The purpose of this dataset is to provide a more realistic and representative environment for evaluating the effectiveness of malware analysis techniques. Different conventional machine learning algorithms including but not limited to Support Vector Machine (SVM), Random Forrest (RF), Extreme Gradient Boosting (XGBOOST) etc are applied and contrasted. The results demonstrated that XGBoost outperformed the other algorithms, achieving an accuracy of f 82%, precision of 88%, recall of 80%, and an F1-Score of 83%.
- Should you consider adware as malware in your study? In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 604–608. IEEE, 2019.
- Zero-day malware detection and effective malware analysis using shapley ensemble boosting and bagging approach. Sensors, 22(7):2798, 2022.
- Malware-smell: A zero-shot learning strategy for detecting zero-day vulnerabilities. Computers & Security, 120:102785, 2022.
- Heaven: A hardware-enhanced antivirus engine to accelerate real-time, signature-based malware detection. Expert Systems with Applications, 201:117083, 2022.
- Nur Syuhada Selamat and Fakariah Hani Mohd Ali. Polymorphic malware detection based on supervised machine learning. Journal of Positive School Psychology, 6(3):8538–8547, 2022.
- Ross Brewer. Ransomware attacks: detection, prevention and cure. Network Security, 2016(9):5–9, 2016.
- Rabia Tahir. A study on malware and malware detection techniques. International Journal of Education and Management Engineering, 8(2):20, 2018.
- Joseph Menn. Fatal system error: the hunt for the new crime lords who are bringing down the internet. PublicAffairs, 2010.
- the 0wned price index. IEEE Security & Privacy, 7(1):86–87, 2009.
- A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access, 2022.
- Deep learning and regularization algorithms for malicious code classification. IEEE Access, 9:91512–91523, 2021.
- Understanding and mitigating banking trojans: From zeus to emotet. In 2021 IEEE International Conference on Cyber Security and Resilience (CSR), pages 121–128. IEEE, 2021.
- Comparing the performance of supervised machine learning algorithms when used with a manual feature selection process to detect zeus malware. International Journal of Grid and Utility Computing, 13(5):495–504, 2022.
- Ideres: Intrusion detection and response system using machine learning and attack graphs. Journal of Systems Architecture, 131:102722, 2022.
- Malware images: visualization and automatic classification. In Proceedings of the 8th international symposium on visualization for cyber security, pages 1–7, 2011.
- Using convolutional neural networks for classification of malware represented as images. Journal of Computer Virology and Hacking Techniques, 15:15–28, 2019.
- Microsoft malware classification challenge. arXiv preprint arXiv:1802.10135, 2018.
- Image visualization based malware detection. In 2013 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), pages 40–44. IEEE, 2013.
- Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the sixth ACM conference on data and application security and privacy, pages 183–194, 2016.
- Malicious software classification using transfer learning of resnet-50 deep neural network. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1011–1014. IEEE, 2017.
- How to automatically identify the homology of different malware. In 2016 IEEE Trustcom/BigDataSE/ISPA, pages 929–936. IEEE, 2016.
- Maldozer: Automatic framework for android malware detection using deep learning. Digital Investigation, 24:S48–S59, 2018.
- Drebin: Effective and explainable detection of android malware in your pocket. In Ndss, volume 14, pages 23–26, 2014.
- Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637, 2018.
- Avclass: A tool for massive malware labeling. In Research in Attacks, Intrusions, and Defenses: 19th International Symposium, RAID 2016, Paris, France, September 19-21, 2016, Proceedings 19, pages 230–253. Springer, 2016.
- Malpedia: a collaborative effort to inventorize the malware landscape. Proceedings of the Botconf, 2017.
- Certified pup: abuse in authenticode code signing. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 465–478, 2015.
- Mtnet: a multi-task neural network for dynamic malware classification. In Detection of Intrusions and Malware, and Vulnerability Assessment: 13th International Conference, DIMVA 2016, San Sebastián, Spain, July 7-8, 2016, Proceedings 13, pages 399–418. Springer, 2016.
- A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
- Survey: Image mixing and deleting for data augmentation. arXiv preprint arXiv:2106.07085, 2021.
- Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01):20–28, 2021.
- An introduction to statistical learning, volume 112. Springer, 2013.
- Constructing optimal binary decision trees is np-complete. Information processing letters, 5(1):15–17, 1976.
- Bootstrap aggregating and random forest. Macroeconomic forecasting in the era of big data: Theory and practice, pages 389–429, 2020.
- Predictive analytics with gradient boosting in clinical medicine. Annals of translational medicine, 7(7), 2019.
- A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54:1937–1967, 2021.
- Boosting algorithms for network intrusion detection: A comparative evaluation of real adaboost, gentle adaboost and modest adaboost. Engineering Applications of Artificial Intelligence, 94:103770, 2020.
- The improved adaboost algorithms for imbalanced data classification. Information Sciences, 563:358–374, 2021.
- Balázs Kégl. The return of adaboost. mh: multi-class hamming trees. arXiv preprint arXiv:1312.6086, 2013.
- A comprehensive comparative study of artificial neural network (ann) and support vector machines (svm) on stock forecasting. Annals of Data Science, 10(1):183–208, 2023.
- Vibration-based anomaly detection using lstm/svm approaches. Mechanical Systems and Signal Processing, 169:108752, 2022.
- A cnn-svm study based on selected deep features for grapevine leaves classification. Measurement, 188:110425, 2022.
- A hybrid intrusion detection model using ega-pso and improved random forest method. Sensors, 22(16):5986, 2022.
- Risk assessment of coronary heart disease based on cloud-random forest. Artificial Intelligence Review, 56(1):203–232, 2023.
- Evaluating xgboost for balanced and imbalanced data: Application to fraud detection. arXiv preprint arXiv:2303.15218, 2023.
- A data-driven design for fault detection of wind turbines using random forests and xgboost. Ieee Access, 6:21020–21031, 2018.
- Xgboost-based algorithm interpretation and application on post-fault transient stability status prediction of power system. IEEE Access, 7:13149–13158, 2019.
- A pedestrian detection method based on genetic algorithm for optimize xgboost training parameters. IEEE Access, 7:118310–118321, 2019.
- Emotion recognition system based on two-level ensemble of deep-convolutional neural network models. IEEE Access, 11:16875–16895, 2023.
- Weighted voting ensemble method for predicting workpiece imaging dimensional deviation based on monocular vision systems. Optics & Laser Technology, 159:109012, 2023.
- Improving accuracy of document image classification through soft voting ensemble. In Smart Applications with Advanced Machine Learning and Human-Centred Problem Design, pages 161–173. Springer, 2023.
- Stacking-and voting-based ensemble deep learning models (sedl and vedl) and active learning (al) for mapping land subsidence. Environmental Science and Pollution Research, 30(10):26580–26595, 2023.