Papers
Topics
Authors
Recent
2000 character limit reached

EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis (2310.01835v1)

Published 3 Oct 2023 in cs.LG

Abstract: In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER - one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space. Our contribution is threefold: (1) we publish EMBERSim, an augmented version of EMBER, that includes similarity-informed tags; (2) we enrich EMBERSim with automatically determined malware class tags using the open-source tool AVClass on VirusTotal data and (3) we describe and share the implementation for our class scoring technique and leaf similarity method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Proud-mal: static analysis-based progressive framework for deep unsupervised malware classification of windows portable executable. Complex & Intelligent Systems, pages 1–13, 2022.
  2. Internet security threat report. https://docs.broadcom.com/doc/istr-24-2019-en, 2019. Accessed: 2023-06-06.
  3. Deep learning based sequential model for malware analysis using windows exe api calls. PeerJ Computer Science, 6:e285, 2020.
  4. Learning the pe header, malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 121–132, 2017.
  5. Signature tree generation for polymorphic worms. IEEE transactions on computers, 60(4):565–579, 2010.
  6. Codeformer: A gnn-nested transformer model for binary code similarity detection. Electronics, 12(7):1722, 2023.
  7. B in dnn: Resilient function matching using deep learning. In Security and Privacy in Communication Networks: 12th International Conference, pages 517–537. Springer, 2017.
  8. α𝛼\alphaitalic_αdiff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 667–678, 2018.
  9. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy, pages 472–489. IEEE, 2019.
  10. Bindeep: a deep learning approach to binary code similarity detection. Expert Systems with Applications, 168:114348, 2021.
  11. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium, pages 2099–2116, 2022.
  12. Similarity hash based scoring of portable executable files for efficient malware detection in iot. Future Generation Computer Systems, 110:824–832, 2020.
  13. Georg Wicherski. pehash: A novel approach to fast malware clustering. LEET, 9:8, 2009.
  14. Static analyzer of vicious executables (save). In 20th Annual Computer Security Applications Conference, pages 326–334. IEEE, 2004.
  15. Operating system market share. https://netmarketshare.com/operating-system-market-share.aspx, 2023. Accessed: 2023-06-03.
  16. Practical binary code similarity detection with bert-based transferable similarity learning. In Proceedings of the 38th Annual Computer Security Applications Conference, pages 361–374, 2022.
  17. A survey of binary code similarity. ACM Computing Surveys, 54(3):1–38, 2021.
  18. Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637, 2018.
  19. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  20. Geometry-and accuracy-preserving random forest proximities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  21. Graph-based comparison of executable objects (english version). Information and Communications Technology Security Symposium, 5(1):3, 2005.
  22. Bmat-a binary matching tool. Feedback-Directed Optimization, 1999.
  23. Halvar Flake. Structural comparison of executable objects. In Ulrich Flegel and Michael Meier, editors, Detection of Intrusions and Malware & Vulnerability Assessment, GI SIG SIDAR Workshop, volume P-46 of LNI, pages 161–173. GI, 2004.
  24. Tracelet-based code search in executables. ACM SIGPLAN Notices, 49(6):349–360, 2014.
  25. Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3:91–97, 2006.
  26. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721–2744, 2006.
  27. Neural machine translation inspired binary code similarity comparison beyond function pairs. In 26th Annual Network and Distributed System Security Symposium. The Internet Society, 2019.
  28. Deepbindiff: Learning program-wide code representations for binary diffing. In 27th Annual Network and Distributed System Security Symposium. The Internet Society, 2020.
  29. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363–376, 2017.
  30. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1145–1152, 2020.
  31. jtrans: jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1–13, 2022.
  32. Detecting self-mutating malware using control-flow graph matching. In Detection of Intrusions and Malware & Vulnerability Assessment: Third International Conference, pages 129 –143. Springer, 2006.
  33. Large-scale malware indexing using function-call graphs. In Proceedings of the 16th ACM conference on Computer and communications security, pages 611–620, 2009.
  34. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
  35. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and trends® in computer graphics and vision, 7(2–3):81–227, 2012.
  36. Random forests. Ensemble machine learning: Methods and applications, pages 157–175, 2012.
  37. Random forest similarity for protein-protein interaction prediction from multiple sources. In Biocomputing 2005, pages 531–542. World Scientific, 2005.
  38. Random forest-based similarity measures for multi-modal classification of alzheimer’s disease. NeuroImage, 65:167–175, 2013.
  39. Random forest-based manifold learning for classification of imaging data in dementia. In Machine Learning in Medical Imaging: Second International Workshop, pages 159–166. Springer, 2011.
  40. Identifying shared software components to support malware forensics. In Detection of Intrusions and Malware, and Vulnerability Assessment: 11th International Conference, pages 21–40. Springer, 2014.
  41. Polymorphic worm detection using structural information of executables. In Recent Advances in Intrusion Detection: 8th International Symposium, pages 207–226. Springer, 2006.
  42. Fossil: a resilient and efficient system for identifying foss functions in malware binaries. ACM Transactions on Privacy and Security, 21(2):1–34, 2018.
  43. Robust intelligent malware detection using deep learning. IEEE Access, 7:46717–46738, 2019.
  44. Enhancing machine learning based malware detection model by reinforcement learning. In Proceedings of the 8th International Conference on Communication and Network Security, pages 74–78, 2018.
  45. Static pe malware detection using gradient boosting decision trees algorithm. In Future Data and Security Engineering: 5th International Conference, pages 228–236. Springer, 2018.
  46. Evaluating performance maintenance and deterioration over time of machine learning-based malware detection models on the ember pe dataset. In Seventh International Conference on Social Networks Analysis, Management and Security, pages 1–7. IEEE, 2020.
  47. MOTIF: A malware reference dataset with ground truth family labels. Computer Security, 124:102921, 2023.
  48. Avclass2: Massive malware tag extraction from AV labels. In Annual Computer Security Applications Conference, pages 42–53. ACM, 2020.
  49. Avclass: A tool for massive malware labeling. In Fabian Monrose, Marc Dacier, Gregory Blanc, and Joaquín García-Alfaro, editors, Research in Attacks, Intrusions, and Defenses - 19th International Symposium, volume 9854 of Lecture Notes in Computer Science, pages 230–253. Springer, 2016.
  50. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  51. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  52. Fred Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, 1964.
  53. DREBIN: effective and explainable detection of android malware in your pocket. In Proceeding of NDSS, 2014.
  54. John E. Moody. Fast learning in multi-resolution hierarchies. In David S. Touretzky, editor, Proceedings of NIPS, pages 29–39, 1988.
  55. Jilei Yang. Fast treeshap: Accelerating SHAP value computation for trees. CoRR, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.