Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Assemblage: Automatic Binary Dataset Construction for Machine Learning (2405.03991v2)

Published 7 May 2024 in cs.CR and cs.LG

Abstract: Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. A Plea for Utilising Synthetic Data when Performing Machine Learning Based Cyber-Security Experiments. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pages 37–45, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3153-1. doi: 10.1145/2666652.2666663. URL http://doi.acm.org/10.1145/2666652.2666663. Series Title: AISec ’14.
  2. Practical binary code similarity detection with bert-based transferable similarity learning. In Proceedings of the 38th Annual Computer Security Applications Conference, ACSAC ’22, page 361–374, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397599. doi: 10.1145/3564625.3567975. URL https://doi.org/10.1145/3564625.3567975.
  3. Extending source code pre-trained language models to summarise decompiled binarie. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 260–271. IEEE, 2023.
  4. H. S. Anderson and P. Roth. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. ArXiv e-prints, April 2018.
  5. Binbert: Binary code understanding with a fine-tunable and execution-aware transformer, 2022.
  6. Polyglot and distributed software repository mining with crossflow. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR ’20, page 374–384, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375177. doi: 10.1145/3379597.3387481. URL https://doi.org/10.1145/3379597.3387481.
  7. Evading malware classifiers via monte carlo mutant feature discovery, 2021.
  8. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  9. Augmenting decompiler output with learned variable names and types. In 31st USENIX Security Symposium, Boston, MA, August 2022. URL https://www.usenix.org/conference/usenixsecurity22/presentation/chen-qibin.
  10. Accelerating Frank-Wolfe via Averaging Step Directions, May 2022. URL http://arxiv.org/abs/2205.11794. arXiv:2205.11794 [cs, math].
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019a. URL https://api.semanticscholar.org/CorpusID:52967399.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019b.
  13. Automatic recovery of fine-grained compiler artifacts at the binary level. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 853–868, 2022.
  14. Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus. In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE), pages 1–10. IEEE, October 2018. ISBN 978-1-72810-155-2. doi: 10.1109/MALWARE.2018.8659360. URL http://arxiv.org/abs/1806.04773. arXiv: 1806.04773.
  15. Georgios Gousios. The ghtorent dataset and tool suite. In 2013 10th Working Conference on Mining Software Repositories (MSR), page 233–236, May 2013. doi: 10.1109/MSR.2013.6624034.
  16. Uniasm: Binary code similarity detection without fine-tuning, 2023.
  17. Do machine learning platforms provide out-of-the-box reproducibility? Future Generation Computer Systems, 126:34–47, January 2022. ISSN 0167-739X. doi: 10.1016/j.future.2021.06.014. URL https://www.sciencedirect.com/science/article/pii/S0167739X21002090.
  18. Bertdeep-ware: A cross-architecture malware detection solution for iot systems. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 927–934, 2021. doi: 10.1109/TrustCom53373.2021.00130.
  19. Debin: Predicting debug information in stripped binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 1667–1680, 2018.
  20. Binprov: Binary code provenance identification without disassembly. In Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’22, page 350–363, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397049. doi: 10.1145/3545948.3545956. URL https://doi.org/10.1145/3545948.3545956.
  21. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS ’22, page 1631–1645, New York, NY, USA, 2022. Association for Computing Machinery. doi: 10.1145/3548606.3560612. URL https://doi.org/10.1145/3548606.3560612.
  22. Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time. In Proceedings of the Conference on Applied Machine Learning for Information Security, 2021a. arXiv: 2201.00757v1.
  23. A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21). Association for Computing Machinery, 2021b. doi: 10.1145/3474369.3486867. arXiv: 2109.11126v1.
  24. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. In The AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS), 2022. doi: 10.48550/arXiv.2111.15031. URL https://github.com/boozallen/MOTIF. arXiv: 2111.15031v1.
  25. Avscan2vec: Feature learning on antivirus scan data for production-scale malware corpora. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 185–196, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702600. doi: 10.1145/3605764.3623907. URL https://doi.org/10.1145/3605764.3623907.
  26. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Transactions on Software Engineering, pages 1–23, 2022. doi: 10.1109/TSE.2022.3187689.
  27. Semantic-aware binary code representation with bert, 2021.
  28. Github cloner & compiler, 2020. URL https://github.com/huzecong/ghcc. (Accessed Aug 15, 2023).
  29. Dire: A neural approach to decompiled identifier naming. In 34th IEEE/ACM International Conference on Automated Software Engineering, pages 628–639, San Diego, CA, 2019.
  30. I-mad: Interpretable malware detector using galaxy transformer. Computers & Security, 108:102371, 2021a. ISSN 0167-4048. doi: https://doi.org/10.1016/j.cose.2021.102371. URL https://www.sciencedirect.com/science/article/pii/S0167404821001954.
  31. Palmtree: Learning an assembly language model for instruction embedding. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021b. URL https://api.semanticscholar.org/CorpusID:232134887.
  32. Graph matching networks for learning the similarity of graph structured objects. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3835–3845. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/li19d.html.
  33. Mining internet-scale software repositories. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/a532400ed62e772b9dc0b86f46e583ff-Paper.pdf.
  34. Codeformer: A gnn-nested transformer model for binary code similarity detection. Electronics, 12(7), 2023a. ISSN 2079-9292. doi: 10.3390/electronics12071722. URL https://www.mdpi.com/2079-9292/12/7/1722.
  35. Automated binary analysis: A survey. In Weizhi Meng, Rongxing Lu, Geyong Min, and Jaideep Vaidya, editors, Algorithms and Architectures for Parallel Processing, pages 392–411, Cham, 2023b. Springer Nature Switzerland. ISBN 978-3-031-22677-9.
  36. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22), pages 2099–2116, 2022.
  37. Microsoft. Microsoft visual studio dia2dump sample, 2022a. URL https://learn.microsoft.com/en-us/visualstudio/debugger/debug-interface-access/dia2dump-sample?view=vs-2022.
  38. Microsoft. vcpkg c/c++ dependency manager from microsoft, 2022b. URL https://vcpkg.io/en/.
  39. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  40. Small effect sizes in malware detection? make harder train/test splits! In Conference on Applied Machine Learning for Information Security (CAMLIS), pages 181–192, 2023. URL https://ceur-ws.org/Vol-3652/paper12.pdf.
  41. Xda: Accurate, robust disassembly with transfer learning. In Proceedings of the 2021 Network and Distributed System Security Symposium (NDSS), 2021.
  42. Edward Raff. A Step Toward Quantifying Independently Reproducible Machine Learning Research. In NeurIPS, 2019. URL http://arxiv.org/abs/1909.06674. arXiv: 1909.06674.
  43. Edward Raff. Research Reproducibility as a Survival Analysis. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021. URL http://arxiv.org/abs/2012.09932. arXiv: 2012.09932.
  44. A Survey of Machine Learning Methods and Challenges for Windows Malware Classification. In NeurIPS 2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA), 2020. URL http://arxiv.org/abs/2006.09271. arXiv: 2006.09271.
  45. Automatic Yara Rule Generation Using Biclustering. In 13th ACM Workshop on Artificial Intelligence and Security (AISec’20), 2020. doi: 10.1145/3411508.3421372. URL http://arxiv.org/abs/2009.03779. arXiv: 2009.03779.
  46. J-Michael Roberts. Virus Share, 2011. URL https://virusshare.com/.
  47. SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), pages 149–163, San Sebastian, October 2020. USENIX Association. ISBN 978-1-939133-18-2. URL https://www.usenix.org/conference/raid2020/presentation/omar.
  48. Maat: Automatically analyzing virustotal for accurate labeling and effective malware detection. ACM Trans. Priv. Secur., 24(4), jul 2021. ISSN 2471-2566. doi: 10.1145/3465361. URL https://doi.org/10.1145/3465361.
  49. Statcounter. Desktop operating system market share worldwide. https://gs.statcounter.com/os-market-share/desktop/worldwide/, Sept. 2023. URL https://gs.statcounter.com/os-market-share/desktop/worldwide/. Accessed: 2023-10-09.
  50. jtrans: jump-aware transformer for binary code similarity detection. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022. URL https://api.semanticscholar.org/CorpusID:249062999.
  51. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022. URL https://api.semanticscholar.org/CorpusID:249674500.
  52. Marvolo: Programmatic data augmentation for deep malware detection. In Machine Learning and Knowledge Discovery in Databases: Research Track: European Conference, ECML PKDD 2023, Turin, Italy, September 18–22, 2023, Proceedings, Part I, page 270–285, Berlin, Heidelberg, 2023. Springer-Verlag. ISBN 978-3-031-43411-2. doi: 10.1007/978-3-031-43412-9˙16. URL https://doi.org/10.1007/978-3-031-43412-9_16.
  53. Malware classification by learning semantic and structural features of control flow graphs. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 540–547, 2021. doi: 10.1109/TrustCom53373.2021.00084.
  54. DeepDi: Learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In 31st USENIX Security Symposium (USENIX Security 22), pages 2709–2725, Boston, MA, August 2022. USENIX Association. ISBN 978-1-939133-31-1. URL https://www.usenix.org/conference/usenixsecurity22/presentation/yu-sheng.
  55. firm vulseeker: Bert and siamese based vulnerability for embedded device firmware images. In 2021 IEEE Symposium on Computers and Communications (ISCC), pages 1–7, 2021. doi: 10.1109/ISCC53001.2021.9631481.
  56. Measuring and modeling the label dynamics of online Anti-Malware engines. In 29th USENIX Security Symposium (USENIX Security 20), pages 2361–2378. USENIX Association, August 2020a. ISBN 978-1-939133-17-5. URL https://www.usenix.org/conference/usenixsecurity20/presentation/zhu.
  57. Benchmarking label dynamics of virustotal engines. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, page 2081–2083, New York, NY, USA, 2020b. Association for Computing Machinery. ISBN 9781450370899. doi: 10.1145/3372297.3420013. URL https://doi.org/10.1145/3372297.3420013.
  58. ktrans: Knowledge-aware transformer for binary code embedding, 2023.
Citations (1)

Summary

  • The paper presents Assemblage, a tool that automates binary dataset construction using a cloud-based distributed system for enhanced scalability and reproducibility.
  • It details how tracking build configurations and licensing on 890k Windows and 428k Linux binaries ensures dataset diversity and legal compliance.
  • The paper evaluates ML models on tasks like compiler provenance and function similarity, highlighting the benefits of realistic, extensive datasets.

A Closer Look at Assemblage: Enhancing Binary Dataset Construction for Machine Learning

Introduction to Assemblage

In the domain of binary analysis, which is crucial for tasks like reverse engineering and malware detection, the challenge of obtaining high-quality datasets of binary files, particularly benign ones for Windows, has been a stumbling block for years. Assemblage emerges as a novel tool designed to alleviate these issues by automating the construction of large, diverse binary corpora. Operating in a cloud-based distributed system framework, Assemblage is adept at crawling code hosting platforms like GitHub, configuring, and building binaries with an eye toward reproducibility and extendibility.

Key Features of Assemblage

  • Scalability and Reproducibility: One of Assemblage’s core strengths is its robust architecture. It successfully operates on a cloud infrastructure utilizing a coordinator node to manage tasks and a pool of worker nodes. This ensures not only high throughput but also resistance against individual component failures. The ability to reproduce dataset builds reliably makes Assemblage particularly valuable for academic and industrial research environments.
  • Extensive Data Collection: Over the span of a year, running on AWS, Assemblage has amassed an impressive dataset including 890k Windows PE and 428k Linux ELF binaries. What stands out is the system’s ability to track and record detailed build configurations, which allows for reconstruction of the building environment and analyzing the provenance of each binary.
  • Licensing and Compliance: Assemblage meticulously tracks the licenses under which the source code is published. This attention to legal details paves the way for distributing and using the datasets without infringing on software licenses, which has been a notable barrier in dataset creation for binary analysis.

Practical Applications and Evaluations

Using the rich datasets generated by Assemblage, various machine learning models for binary analysis were evaluated. Tasks like compiler provenance, binary function similarity, and more were explored with mixed results, illuminating both the capabilities and the current limits of existing models.

  1. Compiler Provenance: The detailed build configuration data allowed for testing models that predict compiler settings from binary files. Results indicated a clear need for models that can understand Windows binaries as accurately as they do for Linux.
  2. Binary Function Similarity: Evaluating models on function similarity tasks demonstrated that training on diverse and realistic data like that provided by Assemblage reveals significant generalizability issues in models trained on smaller, less varied datasets.
  3. Transformer-based Learning: Recent advances in applying transformer models to binary analysis were tested. The findings suggested that while these models perform well on the specific datasets they were trained on, their performance on an extended, diverse dataset from Assemblage was not as robust, highlighting the importance of diverse training datasets.

Implications and Future Directions

By addressing the urgent need for comprehensive and compliant datasets, Assemblage not only supports current research but also sets the stage for future advancements in binary analysis. The ability to train models on realistically varied data is likely to lead to more robust, generalizable tools for cybersecurity and malware detection.

Further development of Assemblage could include enhancements in malware detection capabilities within the dataset generation process, broader support for additional binary formats, and even more extensive datasets covering varied source platforms.

Conclusion

Assemblage is a pivotal development in the field of machine learning applied to binary analysis, particularly for its focus on creating reproducible, large-scale datasets that adhere to licensing requirements. Its cloud-based, distributed architecture showcases a sophisticated approach to a complex problem, marking a significant step forward for researchers and practitioners in the field. The comprehensive testing against current machine learning models highlights not only the utility of Assemblage but also the challenges ahead, illuminating the path for future research and development in binary analysis tools.

Reddit Logo Streamline Icon: https://streamlinehq.com