- The paper introduces BinPool, a novel dataset that provides vulnerable and patched binaries compiled at four optimization levels to enable precise vulnerability detection and evaluation.
- It details an automated curation process using Debian Security Tracker, snapshots, and standard package tools to extract metadata mapping source code changes to binary offsets.
- The dataset comprises 603 CVEs from 162 Debian packages, offering a robust benchmark for evaluating both machine learning and traditional binary security analysis tools.
The paper "BinPool: A Dataset of Vulnerabilities for Binary Security Analysis" (2504.19055) introduces a new publicly available dataset specifically designed to facilitate the development and evaluation of vulnerability detection techniques, particularly those operating at the binary code level. The authors highlight the fundamental reliance of such techniques, especially machine learning-based ones, on robust and appropriately labeled datasets. They argue that existing datasets for binary security analysis suffer from significant limitations, including lack of public availability, insufficient semantic diversity, reliance on artificially introduced vulnerabilities, or potentially incorrect labeling derived from static analysis tools.
To address these issues, BinPool was created. It is a collection of binaries exhibiting historical, real-world vulnerabilities found in Debian packages. A key feature of the dataset is the inclusion of both the vulnerable and the corresponding patched versions of each program, compiled at four different optimization levels, including versions with debug symbols. This structure is valuable for tasks requiring comparisons between vulnerable and fixed code or analysis across different compilation settings.
The dataset's curation process is largely automated and leverages resources from the Debian project: the Debian Security Tracker, the archive of Debian snapshots, and the standard Debian package building system. The process involves three main phases:
- Vulnerability Data Collection: CVE-IDs, associated CWEs, and version information about affected and fixed packages are gathered from the Debian Security Tracker and NVD. Links to corresponding source code archives are found using Debian Snapshots.
- Package Build Process: The automated system uses standard Debian tools (
build-dep
for dependencies, dpkg-buildpackage
for compilation) and the quilt
tool to apply or remove specific patches related to the vulnerability. This allows for building both vulnerable and patched versions. Each variant is built at four optimization levels (-O0
, -O1
, -O2
, -O3
) and includes debug symbols (-g
).
- Metadata Extraction: After building, the system extracts
.deb
files and locates the relevant binaries (ELF files) modified by the patch. Debug information (DWARF) embedded in the binaries is used to map source code locations (files, functions, lines affected by the patch, identified by parsing the patch file and using the clang frontend) to precise memory offsets in the binary.
The resulting BinPool dataset includes data for 603 distinct CVEs across 89 CWE classes, derived from 162 Debian packages. It contains a total of 6144 binaries. The dataset's structure includes metadata files (pkl/JSON), the vulnerable and patched binaries themselves, and a central CSV file detailing CVEs, CWEs, versions, and links to source code. The detailed metadata provides information about the specific source functions (910 unique) and binary functions (7280 unique) involved in the vulnerabilities, including their exact locations.
The paper suggests several potential applications for BinPool:
- Vulnerability Discovery: It serves as a benchmark for evaluating both machine learning-based tools (like the authors' previous work, BinHunter [2]) and traditional program analysis systems (like angr [shoshitaishvili2016sok]). The dataset's diversity in real-world vulnerabilities and program semantics makes it a challenging testbed.
- Benchmarking Intermediate Analyses: The detailed metadata can be used to evaluate components of larger analysis pipelines, such as precise data flow analysis in binaries.
- Binary Function Similarity and Code Search: The presence of matched vulnerable and patched binaries compiled at different optimization levels makes the dataset suitable for benchmarking algorithms designed to detect similarity or search for specific code across variations.
While BinPool offers significant advantages in terms of real-world data and detailed metadata, the authors acknowledge limitations. The number of CVEs per CWE category is still limited, making it more suitable as an evaluation test set rather than for training robust machine learning classifiers in isolation. The dataset also currently lacks information beyond code modifications, such as failing test cases, error traces from symbolic execution, or comprehensive inter-procedural data flow details, which could provide deeper insights into vulnerability triggers and propagation. Future work aims to address these limitations by continuously expanding the dataset and enriching the metadata.
Overall, BinPool is presented as a valuable resource for the binary security research community, providing a standardized, real-world dataset with detailed annotations to drive the development and evaluation of advanced vulnerability detection and binary analysis tools. The dataset and automation scripts are publicly available on GitHub.