Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Catch'em all: Classification of Rare, Prominent, and Novel Malware Families (2403.02546v1)

Published 4 Mar 2024 in cs.CR

Abstract: National security is threatened by malware, which remains one of the most dangerous and costly cyber threats. As of last year, researchers reported 1.3 billion known malware specimens, motivating the use of data-driven ML methods for analysis. However, shortcomings in existing ML approaches hinder their mass adoption. These challenges include detection of novel malware and the ability to perform malware classification in the face of class imbalance: a situation where malware families are not equally represented in the data. Our work addresses these shortcomings with MalwareDNA: an advanced dimensionality reduction and feature extraction framework. We demonstrate stable task performance under class imbalance for the following tasks: malware family classification and novel malware detection with a trade-off in increased abstention or reject-option rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. The Independent IT Security Institute, “Malware statistics & trends report: Av-test,” February 2024.
  2. E. Raff and C. Nicholas, “A survey of machine learning methods and challenges for windows malware classification,” ArXiv, vol. abs/2006.09271, 2020.
  3. IBM, “Cost of a data breach report,” IBM, Technical Report, 2021. [Online]. Available: https://www.ibm.com/security/data-breach
  4. A. T. Nguyen, E. Raff, C. Nicholas, and J. Holt, “Leveraging uncertainty for improved static malware detection under extreme false positive constraints,” arXiv preprint arXiv:2108.04081, 2021.
  5. M. E. Eren, M. Bhattarai, K. Rasmussen, B. S. Alexandrov, and C. Nicholas, “Malwaredna: Simultaneous classification of malware, malware families, and novel malware,” in 2023 IEEE International Conference on Intelligence and Security Informatics (ISI), 2023.
  6. M. E. Eren, M. Bhattarai, R. J. Joyce, E. Raff, C. Nicholas, and B. S. Alexandrov, “Semi-supervised classification of malware families under extreme class imbalance via hierarchical non-negative matrix factorization with automatic model selection,” ACM Trans. Priv. Secur., vol. 26, no. 4, nov 2023. [Online]. Available: https://doi.org/10.1145/3624567
  7. B. Alexandrov, V. Vesselinov, and K. O. Rasmussen, “Smarttensors unsupervised ai platform for big-data analytics,” Los Alamos National Lab.(LANL), Los Alamos, NM (United States), Tech. Rep., 2021, lA-UR-21-25064.
  8. Y. Ding, J. Liu, J. Xiong, and Y. Shi, “Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 4–5.
  9. X.-Y. Zhang, G.-S. Xie, X. Li, T. Mei, and C.-L. Liu, “A survey on learning to reject,” Proceedings of the IEEE, vol. 111, no. 2, pp. 185–215, 2023.
  10. H. Anderson and P. Roth, “Ember: An open dataset for training static pe malware machine learning models,” ArXiv, vol. abs/1804.04637, 2018.
  11. E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, and C. K. Nicholas, “Malware detection by eating a whole exe,” in AAAI Workshops, 2018.
  12. R. Kumar and S. Geetha, “Malware classification using xgboost-gradient boosted decision tree,” Adv. Sci. Technol. Eng. Syst, vol. 5, pp. 536–549, 2020.
  13. H.-D. Pham, T. D. Le, and T. N. Vu, “Static pe malware detection using gradient boosting decision trees algorithm,” in Future Data and Security Engineering, T. K. Dang, J. Küng, R. Wagner, N. Thoai, and M. Takizawa, Eds.   Cham: Springer International Publishing, 2018, pp. 228–236.
  14. Y. Ye, T. Li, Y. Chen, and Q. Jiang, “Automatic malware categorization using cluster ensemble,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’10.   New York, NY, USA: Association for Computing Machinery, 2010, pp. 95–104.
  15. Y. Zhang, C. Rong, Q. Huang, Y. Wu, Z. Yang, and J. Jiang, “Based on multi-features and clustering ensemble method for automatic malware categorization,” in 2017 IEEE Trustcom/BigDataSE/ICESS, 2017, pp. 73–82.
  16. D. Kong and G. Yan, “Discriminant malware distance learning on structural information for automated malware classification,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’13.   New York, NY, USA: Association for Computing Machinery, 2013, pp. 1357–1365.
  17. E. Raff, C. K. Nicholas, and M. McLean, “A new burrows wheeler transform markov distance,” in AAAI, 2020.
  18. M. Bak, D. Papp, C. Tamás, and L. Buttyán, “Clustering iot malware based on binary similarity,” in NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium.   IEEE, 2020, pp. 1–6.
  19. V. Atluri, “Malware classification of portable executables using tree-based ensemble machine learning,” in 2019 SoutheastCon.   IEEE, 2019, pp. 1–6.
  20. F. H. Ramadhan, V. Suryani, and S. Mandala, “Analysis study of malware classification portable executable using hybrid machine learning,” in 2021 International Conference on Intelligent Cybernetics Technology Applications (ICICyTA), 2021, pp. 86–91.
  21. G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, “Large-scale malware classification using random projections and neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 3422–3426.
  22. Z. Sawadogo, G. Mendy, J. M. Dembele, and S. Ouya, “Android malware detection: Investigating the impact of imbalanced data-sets on the performance of machine learning models,” in 2022 24th International Conference on Advanced Communication Technology (ICACT).   IEEE, 2022, pp. 435–441.
  23. R. Oak, M. Du, D. Yan, H. Takawale, and I. Amit, “Malware detection on highly imbalanced data through sequence modeling,” in Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, ser. AISec’19.   New York, NY, USA: Association for Computing Machinery, 2019, p. 37–48.
  24. W. Huang and J. Stokes, “Mtnet: A multi-task neural network for dynamic malware classification,” in Proceedings of 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2016).   Springer, July 2016, pp. 399–418.
  25. N. Loi, C. Borile, and D. Ucci, “Towards an automated pipeline for detecting and classifying malware through machine learning,” arXiv preprint arXiv:2106.05625, 2021.
  26. A. Mohaisen, O. Alrawi, and M. Mohaisen, “Amal: High-fidelity, behavior-based automated malware analysis and classification,” Computers & Security, vol. 52, pp. 251–266, 2015.
  27. J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” Proceedings of the national academy of sciences, vol. 101, no. 12, pp. 4164–4169, 2004.
  28. V. Y. Tan and C. Févotte, “Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1592–1605, 2012.
  29. R. Vangara, M. Bhattarai, E. Skau, G. Chennupati, H. Djidjev, T. Tierney, J. P. Smith, V. G. Stanev, and B. S. Alexandrov, “Finding the number of latent topics with semantic non-negative matrix factorization,” IEEE Access, 2021.
  30. I. Boureima, M. Bhattarai, M. Eren, E. Skau, P. Romero, S. Eidenbenz, and B. Alexandrov, “Distributed out-of-memory nmf on cpu/gpu architectures,” The Journal of Supercomputing, pp. 3970–3999, 2024.
  31. B. T. Nebgen, R. Vangara, M. A. Hombrados-Herrera, S. Kuksova, and B. S. Alexandrov, “A neural network for determination of latent dimensionality in non-negative matrix factorization,” Machine Learning: Science and Technology, vol. 2, no. 2, p. 025012, 2021.
  32. M. Eren, N. Solovyev, R. Barron, M. Bhattarai, D. Truong, I. Boureima, E. Skau, K. Rasmussen, and B. Alexandrov, “Tensor Extraction of Latent Features (T-ELF),” Oct. 2023. [Online]. Available: https://github.com/lanl/T-ELF
  33. R. Bro and S. De Jong, “A fast non-negativity-constrained least squares algorithm,” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 11, no. 5, pp. 393–401, 1997.
  34. E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameter-free out-of-distribution detection using cosine similarity,” in Proceedings of the Asian conference on computer vision, 2020.
  35. Y. Bahat and G. Shakhnarovich, “Confidence from invariance to image transformations,” arXiv preprint arXiv:1804.00657, 2018.
  36. ——, “Classification confidence estimation with test-time data-augmentation,” arXiv preprint arXiv:2006.16705, 2020.
  37. T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou et al., “Xgboost: extreme gradient boosting,” R package version 0.4-2, vol. 1, no. 4, pp. 1–4, 2015.
  38. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
  39. D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ser. ACL ’95.   USA: Association for Computational Linguistics, 1995, p. 189–196.
  40. B. Marais, T. Quertier, and C. Chesneau, “Malware analysis with artificial intelligence and a particular attention on results interpretability,” in Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference 18.   Springer, 2022, pp. 43–55.
  41. T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631.
Citations (1)

Summary

We haven't generated a summary for this paper yet.