Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 81 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 104 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

BinPool: A Dataset of Vulnerabilities for Binary Security Analysis (2504.19055v1)

Published 27 Apr 2025 in cs.CR

Abstract: The development of machine learning techniques for discovering software vulnerabilities relies fundamentally on the availability of appropriate datasets. The ideal dataset consists of a large and diverse collection of real-world vulnerabilities, paired so as to contain both vulnerable and patched versions of each program. Naturally, collecting such datasets is a laborious and time-consuming task. Within the specific domain of vulnerability discovery in binary code, previous datasets are either publicly unavailable, lack semantic diversity, involve artificially introduced vulnerabilities, or were collected using static analyzers, thereby themselves containing incorrectly labeled example programs. In this paper, we describe a new publicly available dataset which we dubbed Binpool, containing numerous samples of vulnerable versions of Debian packages across the years. The dataset was automatically curated, and contains both vulnerable and patched versions of each program, compiled at four different optimization levels. Overall, the dataset covers 603 distinct CVEs across 89 CWE classes, 162 Debian packages, and contains 6144 binaries. We argue that this dataset is suitable for evaluating a range of security analysis tools, including for vulnerability discovery, binary function similarity, and plagiarism detection.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces BinPool, a novel dataset that provides vulnerable and patched binaries compiled at four optimization levels to enable precise vulnerability detection and evaluation.
  • It details an automated curation process using Debian Security Tracker, snapshots, and standard package tools to extract metadata mapping source code changes to binary offsets.
  • The dataset comprises 603 CVEs from 162 Debian packages, offering a robust benchmark for evaluating both machine learning and traditional binary security analysis tools.

The paper "BinPool: A Dataset of Vulnerabilities for Binary Security Analysis" (2504.19055) introduces a new publicly available dataset specifically designed to facilitate the development and evaluation of vulnerability detection techniques, particularly those operating at the binary code level. The authors highlight the fundamental reliance of such techniques, especially machine learning-based ones, on robust and appropriately labeled datasets. They argue that existing datasets for binary security analysis suffer from significant limitations, including lack of public availability, insufficient semantic diversity, reliance on artificially introduced vulnerabilities, or potentially incorrect labeling derived from static analysis tools.

To address these issues, BinPool was created. It is a collection of binaries exhibiting historical, real-world vulnerabilities found in Debian packages. A key feature of the dataset is the inclusion of both the vulnerable and the corresponding patched versions of each program, compiled at four different optimization levels, including versions with debug symbols. This structure is valuable for tasks requiring comparisons between vulnerable and fixed code or analysis across different compilation settings.

The dataset's curation process is largely automated and leverages resources from the Debian project: the Debian Security Tracker, the archive of Debian snapshots, and the standard Debian package building system. The process involves three main phases:

  1. Vulnerability Data Collection: CVE-IDs, associated CWEs, and version information about affected and fixed packages are gathered from the Debian Security Tracker and NVD. Links to corresponding source code archives are found using Debian Snapshots.
  2. Package Build Process: The automated system uses standard Debian tools (build-dep for dependencies, dpkg-buildpackage for compilation) and the quilt tool to apply or remove specific patches related to the vulnerability. This allows for building both vulnerable and patched versions. Each variant is built at four optimization levels (-O0, -O1, -O2, -O3) and includes debug symbols (-g).
  3. Metadata Extraction: After building, the system extracts .deb files and locates the relevant binaries (ELF files) modified by the patch. Debug information (DWARF) embedded in the binaries is used to map source code locations (files, functions, lines affected by the patch, identified by parsing the patch file and using the clang frontend) to precise memory offsets in the binary.

The resulting BinPool dataset includes data for 603 distinct CVEs across 89 CWE classes, derived from 162 Debian packages. It contains a total of 6144 binaries. The dataset's structure includes metadata files (pkl/JSON), the vulnerable and patched binaries themselves, and a central CSV file detailing CVEs, CWEs, versions, and links to source code. The detailed metadata provides information about the specific source functions (910 unique) and binary functions (7280 unique) involved in the vulnerabilities, including their exact locations.

The paper suggests several potential applications for BinPool:

  • Vulnerability Discovery: It serves as a benchmark for evaluating both machine learning-based tools (like the authors' previous work, BinHunter [2]) and traditional program analysis systems (like angr [shoshitaishvili2016sok]). The dataset's diversity in real-world vulnerabilities and program semantics makes it a challenging testbed.
  • Benchmarking Intermediate Analyses: The detailed metadata can be used to evaluate components of larger analysis pipelines, such as precise data flow analysis in binaries.
  • Binary Function Similarity and Code Search: The presence of matched vulnerable and patched binaries compiled at different optimization levels makes the dataset suitable for benchmarking algorithms designed to detect similarity or search for specific code across variations.

While BinPool offers significant advantages in terms of real-world data and detailed metadata, the authors acknowledge limitations. The number of CVEs per CWE category is still limited, making it more suitable as an evaluation test set rather than for training robust machine learning classifiers in isolation. The dataset also currently lacks information beyond code modifications, such as failing test cases, error traces from symbolic execution, or comprehensive inter-procedural data flow details, which could provide deeper insights into vulnerability triggers and propagation. Future work aims to address these limitations by continuously expanding the dataset and enriching the metadata.

Overall, BinPool is presented as a valuable resource for the binary security research community, providing a standardized, real-world dataset with detailed annotations to drive the development and evaluation of advanced vulnerability detection and binary analysis tools. The dataset and automation scripts are publicly available on GitHub.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com