Assemblage: Automatic Binary Dataset Construction for Machine Learning (2405.03991v2)
Abstract: Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net
- A Plea for Utilising Synthetic Data when Performing Machine Learning Based Cyber-Security Experiments. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pages 37–45, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3153-1. doi: 10.1145/2666652.2666663. URL http://doi.acm.org/10.1145/2666652.2666663. Series Title: AISec ’14.
- Practical binary code similarity detection with bert-based transferable similarity learning. In Proceedings of the 38th Annual Computer Security Applications Conference, ACSAC ’22, page 361–374, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397599. doi: 10.1145/3564625.3567975. URL https://doi.org/10.1145/3564625.3567975.
- Extending source code pre-trained language models to summarise decompiled binarie. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 260–271. IEEE, 2023.
- H. S. Anderson and P. Roth. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. ArXiv e-prints, April 2018.
- Binbert: Binary code understanding with a fine-tunable and execution-aware transformer, 2022.
- Polyglot and distributed software repository mining with crossflow. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR ’20, page 374–384, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375177. doi: 10.1145/3379597.3387481. URL https://doi.org/10.1145/3379597.3387481.
- Evading malware classifiers via monte carlo mutant feature discovery, 2021.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Augmenting decompiler output with learned variable names and types. In 31st USENIX Security Symposium, Boston, MA, August 2022. URL https://www.usenix.org/conference/usenixsecurity22/presentation/chen-qibin.
- Accelerating Frank-Wolfe via Averaging Step Directions, May 2022. URL http://arxiv.org/abs/2205.11794. arXiv:2205.11794 [cs, math].
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019a. URL https://api.semanticscholar.org/CorpusID:52967399.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019b.
- Automatic recovery of fine-grained compiler artifacts at the binary level. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 853–868, 2022.
- Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus. In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE), pages 1–10. IEEE, October 2018. ISBN 978-1-72810-155-2. doi: 10.1109/MALWARE.2018.8659360. URL http://arxiv.org/abs/1806.04773. arXiv: 1806.04773.
- Georgios Gousios. The ghtorent dataset and tool suite. In 2013 10th Working Conference on Mining Software Repositories (MSR), page 233–236, May 2013. doi: 10.1109/MSR.2013.6624034.
- Uniasm: Binary code similarity detection without fine-tuning, 2023.
- Do machine learning platforms provide out-of-the-box reproducibility? Future Generation Computer Systems, 126:34–47, January 2022. ISSN 0167-739X. doi: 10.1016/j.future.2021.06.014. URL https://www.sciencedirect.com/science/article/pii/S0167739X21002090.
- Bertdeep-ware: A cross-architecture malware detection solution for iot systems. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 927–934, 2021. doi: 10.1109/TrustCom53373.2021.00130.
- Debin: Predicting debug information in stripped binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 1667–1680, 2018.
- Binprov: Binary code provenance identification without disassembly. In Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’22, page 350–363, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397049. doi: 10.1145/3545948.3545956. URL https://doi.org/10.1145/3545948.3545956.
- Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS ’22, page 1631–1645, New York, NY, USA, 2022. Association for Computing Machinery. doi: 10.1145/3548606.3560612. URL https://doi.org/10.1145/3548606.3560612.
- Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time. In Proceedings of the Conference on Applied Machine Learning for Information Security, 2021a. arXiv: 2201.00757v1.
- A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21). Association for Computing Machinery, 2021b. doi: 10.1145/3474369.3486867. arXiv: 2109.11126v1.
- MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. In The AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS), 2022. doi: 10.48550/arXiv.2111.15031. URL https://github.com/boozallen/MOTIF. arXiv: 2111.15031v1.
- Avscan2vec: Feature learning on antivirus scan data for production-scale malware corpora. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 185–196, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702600. doi: 10.1145/3605764.3623907. URL https://doi.org/10.1145/3605764.3623907.
- Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Transactions on Software Engineering, pages 1–23, 2022. doi: 10.1109/TSE.2022.3187689.
- Semantic-aware binary code representation with bert, 2021.
- Github cloner & compiler, 2020. URL https://github.com/huzecong/ghcc. (Accessed Aug 15, 2023).
- Dire: A neural approach to decompiled identifier naming. In 34th IEEE/ACM International Conference on Automated Software Engineering, pages 628–639, San Diego, CA, 2019.
- I-mad: Interpretable malware detector using galaxy transformer. Computers & Security, 108:102371, 2021a. ISSN 0167-4048. doi: https://doi.org/10.1016/j.cose.2021.102371. URL https://www.sciencedirect.com/science/article/pii/S0167404821001954.
- Palmtree: Learning an assembly language model for instruction embedding. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021b. URL https://api.semanticscholar.org/CorpusID:232134887.
- Graph matching networks for learning the similarity of graph structured objects. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3835–3845. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/li19d.html.
- Mining internet-scale software repositories. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/a532400ed62e772b9dc0b86f46e583ff-Paper.pdf.
- Codeformer: A gnn-nested transformer model for binary code similarity detection. Electronics, 12(7), 2023a. ISSN 2079-9292. doi: 10.3390/electronics12071722. URL https://www.mdpi.com/2079-9292/12/7/1722.
- Automated binary analysis: A survey. In Weizhi Meng, Rongxing Lu, Geyong Min, and Jaideep Vaidya, editors, Algorithms and Architectures for Parallel Processing, pages 392–411, Cham, 2023b. Springer Nature Switzerland. ISBN 978-3-031-22677-9.
- How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22), pages 2099–2116, 2022.
- Microsoft. Microsoft visual studio dia2dump sample, 2022a. URL https://learn.microsoft.com/en-us/visualstudio/debugger/debug-interface-access/dia2dump-sample?view=vs-2022.
- Microsoft. vcpkg c/c++ dependency manager from microsoft, 2022b. URL https://vcpkg.io/en/.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Small effect sizes in malware detection? make harder train/test splits! In Conference on Applied Machine Learning for Information Security (CAMLIS), pages 181–192, 2023. URL https://ceur-ws.org/Vol-3652/paper12.pdf.
- Xda: Accurate, robust disassembly with transfer learning. In Proceedings of the 2021 Network and Distributed System Security Symposium (NDSS), 2021.
- Edward Raff. A Step Toward Quantifying Independently Reproducible Machine Learning Research. In NeurIPS, 2019. URL http://arxiv.org/abs/1909.06674. arXiv: 1909.06674.
- Edward Raff. Research Reproducibility as a Survival Analysis. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021. URL http://arxiv.org/abs/2012.09932. arXiv: 2012.09932.
- A Survey of Machine Learning Methods and Challenges for Windows Malware Classification. In NeurIPS 2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA), 2020. URL http://arxiv.org/abs/2006.09271. arXiv: 2006.09271.
- Automatic Yara Rule Generation Using Biclustering. In 13th ACM Workshop on Artificial Intelligence and Security (AISec’20), 2020. doi: 10.1145/3411508.3421372. URL http://arxiv.org/abs/2009.03779. arXiv: 2009.03779.
- J-Michael Roberts. Virus Share, 2011. URL https://virusshare.com/.
- SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), pages 149–163, San Sebastian, October 2020. USENIX Association. ISBN 978-1-939133-18-2. URL https://www.usenix.org/conference/raid2020/presentation/omar.
- Maat: Automatically analyzing virustotal for accurate labeling and effective malware detection. ACM Trans. Priv. Secur., 24(4), jul 2021. ISSN 2471-2566. doi: 10.1145/3465361. URL https://doi.org/10.1145/3465361.
- Statcounter. Desktop operating system market share worldwide. https://gs.statcounter.com/os-market-share/desktop/worldwide/, Sept. 2023. URL https://gs.statcounter.com/os-market-share/desktop/worldwide/. Accessed: 2023-10-09.
- jtrans: jump-aware transformer for binary code similarity detection. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022. URL https://api.semanticscholar.org/CorpusID:249062999.
- Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022. URL https://api.semanticscholar.org/CorpusID:249674500.
- Marvolo: Programmatic data augmentation for deep malware detection. In Machine Learning and Knowledge Discovery in Databases: Research Track: European Conference, ECML PKDD 2023, Turin, Italy, September 18–22, 2023, Proceedings, Part I, page 270–285, Berlin, Heidelberg, 2023. Springer-Verlag. ISBN 978-3-031-43411-2. doi: 10.1007/978-3-031-43412-9˙16. URL https://doi.org/10.1007/978-3-031-43412-9_16.
- Malware classification by learning semantic and structural features of control flow graphs. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 540–547, 2021. doi: 10.1109/TrustCom53373.2021.00084.
- DeepDi: Learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In 31st USENIX Security Symposium (USENIX Security 22), pages 2709–2725, Boston, MA, August 2022. USENIX Association. ISBN 978-1-939133-31-1. URL https://www.usenix.org/conference/usenixsecurity22/presentation/yu-sheng.
- firm vulseeker: Bert and siamese based vulnerability for embedded device firmware images. In 2021 IEEE Symposium on Computers and Communications (ISCC), pages 1–7, 2021. doi: 10.1109/ISCC53001.2021.9631481.
- Measuring and modeling the label dynamics of online Anti-Malware engines. In 29th USENIX Security Symposium (USENIX Security 20), pages 2361–2378. USENIX Association, August 2020a. ISBN 978-1-939133-17-5. URL https://www.usenix.org/conference/usenixsecurity20/presentation/zhu.
- Benchmarking label dynamics of virustotal engines. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, page 2081–2083, New York, NY, USA, 2020b. Association for Computing Machinery. ISBN 9781450370899. doi: 10.1145/3372297.3420013. URL https://doi.org/10.1145/3372297.3420013.
- ktrans: Knowledge-aware transformer for binary code embedding, 2023.