Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

167 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

32 5 6

Assemblage: Automatic Binary Dataset Construction for Machine Learning (2405.03991v2)

Published 7 May 2024 in cs.CR and cs.LG

Abstract: Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net

References (58)

Citations (1)

View on Semantic Scholar

Summary

The paper presents Assemblage, a tool that automates binary dataset construction using a cloud-based distributed system for enhanced scalability and reproducibility.
It details how tracking build configurations and licensing on 890k Windows and 428k Linux binaries ensures dataset diversity and legal compliance.
The paper evaluates ML models on tasks like compiler provenance and function similarity, highlighting the benefits of realistic, extensive datasets.

A Closer Look at Assemblage: Enhancing Binary Dataset Construction for Machine Learning

Introduction to Assemblage

In the domain of binary analysis, which is crucial for tasks like reverse engineering and malware detection, the challenge of obtaining high-quality datasets of binary files, particularly benign ones for Windows, has been a stumbling block for years. Assemblage emerges as a novel tool designed to alleviate these issues by automating the construction of large, diverse binary corpora. Operating in a cloud-based distributed system framework, Assemblage is adept at crawling code hosting platforms like GitHub, configuring, and building binaries with an eye toward reproducibility and extendibility.

Key Features of Assemblage

Scalability and Reproducibility: One of Assemblage’s core strengths is its robust architecture. It successfully operates on a cloud infrastructure utilizing a coordinator node to manage tasks and a pool of worker nodes. This ensures not only high throughput but also resistance against individual component failures. The ability to reproduce dataset builds reliably makes Assemblage particularly valuable for academic and industrial research environments.
Extensive Data Collection: Over the span of a year, running on AWS, Assemblage has amassed an impressive dataset including 890k Windows PE and 428k Linux ELF binaries. What stands out is the system’s ability to track and record detailed build configurations, which allows for reconstruction of the building environment and analyzing the provenance of each binary.
Licensing and Compliance: Assemblage meticulously tracks the licenses under which the source code is published. This attention to legal details paves the way for distributing and using the datasets without infringing on software licenses, which has been a notable barrier in dataset creation for binary analysis.

Practical Applications and Evaluations

Using the rich datasets generated by Assemblage, various machine learning models for binary analysis were evaluated. Tasks like compiler provenance, binary function similarity, and more were explored with mixed results, illuminating both the capabilities and the current limits of existing models.

Compiler Provenance: The detailed build configuration data allowed for testing models that predict compiler settings from binary files. Results indicated a clear need for models that can understand Windows binaries as accurately as they do for Linux.
Binary Function Similarity: Evaluating models on function similarity tasks demonstrated that training on diverse and realistic data like that provided by Assemblage reveals significant generalizability issues in models trained on smaller, less varied datasets.
Transformer-based Learning: Recent advances in applying transformer models to binary analysis were tested. The findings suggested that while these models perform well on the specific datasets they were trained on, their performance on an extended, diverse dataset from Assemblage was not as robust, highlighting the importance of diverse training datasets.

Implications and Future Directions

By addressing the urgent need for comprehensive and compliant datasets, Assemblage not only supports current research but also sets the stage for future advancements in binary analysis. The ability to train models on realistically varied data is likely to lead to more robust, generalizable tools for cybersecurity and malware detection.

Further development of Assemblage could include enhancements in malware detection capabilities within the dataset generation process, broader support for additional binary formats, and even more extensive datasets covering varied source platforms.

Conclusion

Assemblage is a pivotal development in the field of machine learning applied to binary analysis, particularly for its focus on creating reproducible, large-scale datasets that adhere to licensing requirements. Its cloud-based, distributed architecture showcases a sophisticated approach to a complex problem, marking a significant step forward for researchers and practitioners in the field. The comprehensive testing against current machine learning models highlights not only the utility of Assemblage but also the challenges ahead, illuminating the path for future research and development in binary analysis tools.

PDF Markdown

Tweets

https://twitter.com/krismicinski/status/1788211929703735460

https://twitter.com/stankneo/status/1802196666910384299

https://twitter.com/FSFG/status/1788094317191385104

HackerNews

Assemblage: Automatic Binary Dataset Construction for Machine Learning (6 points, 0 comments)

Assemblage: Automatic Binary Dataset Construction for Machine Learning (5 points, 0 comments)