Binary Patch Dataset: Overview & Applications

Updated 14 September 2025

Binary patch dataset is a curated collection of paired binaries representing software before and after patching, enabling detailed vulnerability and security analysis.
These datasets are constructed using automated workflows that compile binaries under various configurations and encode rich metadata, ensuring real-world relevance in benchmark studies.
They underpin research in binary analysis and machine learning-based vulnerability detection by providing diverse, labeled samples from sources like Debian packages and Java libraries.

A binary patch dataset is a curated collection of data samples, each representing the state of binary code before and after the application of a patch. Such datasets are fundamental for research in binary program analysis, vulnerability detection, patch presence testing, and learning-based approaches to binary security. The integrity, diversity, and granularity of a binary patch dataset directly impact the reliability of machine learning models and binary analysis tools benchmarked on it.

1. Definition, Scope, and Dataset Construction

Binary patch datasets consist of binaries compiled both before and after automated or manual patching processes. Construction methods involve sourcing known security patches (e.g., from CVE entries or security trackers), fetching relevant source code snapshots, and compiling binaries under multiple configurations. Notable datasets—such as BinPool (Arasteh et al., 27 Apr 2025), BinGo (He et al., 2023), PS³ (Zhan et al., 2023), and PPT4J (Pan et al., 2023)—follow automated workflows to collect diverse vulnerable and patched binaries, often leveraging open-source repositories (Debian packages, Linux kernel, Java libraries), security advisories, and patch databases for their base data.

For instance, BinPool is curated from Debian packages by selectively applying patches, resulting in pairs of vulnerable and patched binaries compiled at four optimization levels, yielding 6,144 binaries across 603 distinct CVEs and 89 CWE classes (Arasteh et al., 27 Apr 2025). BinGo leverages PatchDB as its source, compiles binaries using both GCC and Clang across optimization levels, and maps source-level changes to binary code via DWARF debugging information (He et al., 2023). PPT4J builds its dataset from 110 vulnerabilities in Java libraries, compiling project snapshots to bytecode (Pan et al., 2023).

2. Data Structure, Metadata, and Labeling

A binary patch dataset's structure typically includes the following:

Component	Description	Example Source
Binary Pair	Pre-patch and post-patch binaries (ELF, PE, class/jar, etc.)	BinPool, PPT4J, BinGo
Metadata	CVE/CWE IDs, affected package/module, timestamp, optimization	BinPool, BinGo
Ground Truth	Explicit mapping of vulnerable vs. patched instances	All referenced datasets
Compilation Info	Compiler, flags, dependencies, versioning	BinPool, BinGo, PPT4J
Change Mapping	Source line–to–binary unit mapping, function-level annotation	BinPool, BinGo

The labeling process is grounded in vulnerability metadata—for example, mapping each binary to a specific CVE and marking before/after patch states via version-controlled commits or patch application scripts. BinPool utilizes automated patch application via tools like quilt and build systems such as dpkg-buildpackage, with metadata stored in CSV, JSON, and pickle formats. BinGo uses diffing algorithms to extract patch-relevant basic blocks at the function and control-flow graph (CFG) levels.

3. Patch Presence Testing and Semantic Detection

Binary patch datasets are indispensable for evaluating semantic patch presence testing frameworks. PS³ (Zhan et al., 2023) exemplifies modern approaches by leveraging semantic-level symbolic signatures extracted via symbolic emulation. Rather than relying solely on syntactic binary differences, PS³ collects semantic side effects—register writes, memory stores, conditions, and function calls—during symbolic function simulation. These semantic signatures are compared against references (patched/vulnerable) using well-defined grammars and theorem provers, yielding robust patch detection performance that remains stable across compiler configurations.

PPT4J (Pan et al., 2023) focuses on semantic change identification for Java bytecode, employing feature extraction from both source and binary and using the Jaccard similarity coefficient to match semantic features between code versions. Fine-grained diff parsing (additions, deletions, modifications) and feature voting determine if the patch is present, enabling high-fidelity detection even amidst semantic redundancy.

4. Learning-based Binary Patch Classification

Binary patch datasets are critical for machine learning models targeting patch and vulnerability detection in binaries. BinGo (He et al., 2023) represents binaries as code property graphs (CPGs), integrating control/data dependencies and employing BERT-based LLMs for node (basic block) embeddings. Siamese graph convolution networks compare pre- and post-patch binaries for patch classification, formalizing the identification function as:

$f_c(p_i) = \begin{cases} p_s & \text{security patch} \ p_{ns} & \text{non-security patch} \end{cases}$

where $\{p_0, p_1, \ldots, p_n\} \in \text{diff}(bin_a, bin_b)$ .

Recent advances such as in "Empirical Study of Code LLMs for Binary Security Patch Detection" (Li et al., 7 Sep 2025) demonstrate a large-scale dataset (19,448 samples in assembly and pseudo-code forms) for benchmarking LLMs on binary SPD tasks. Fine-tuning (e.g., via LoRA) on pseudo-code representations yielded superior model performance versus raw assembly code, indicating that semantic alignment with source code facilitates learned detection.

5. Diversity, Optimization, and Realism

High-quality binary patch datasets account for compiler diversity and optimization levels, capturing real-world compilation variability. BinPool (Arasteh et al., 27 Apr 2025) and BinGo (He et al., 2023) include binaries compiled at multiple optimization levels (O0–O3, Os) and with different compilers (gcc/clang), reflecting the substantial impact of compilation artifacts on binary representations. This diversity enables robust testing of patch detection and similarity algorithms under challenging real-world conditions.

PPT4J (Pan et al., 2023) reproduces Java library binaries via authentic build scripts, ensuring direct relevance to the formats and workflows in real deployment scenarios. The inclusion of multiple data modalities (pseudo-code, assembly, bytecode) as in (Li et al., 7 Sep 2025) further increases dataset coverage.

6. Applications and Benchmarking

Binary patch datasets primarily serve as benchmarks for:

Binary vulnerability discovery
Binary function similarity detection
Plagiarism detection in binary code
Patch presence and security compliance testing
Program analysis via static and dynamic frameworks (e.g., angr)
Machine learning models for binary classification, cloning, and patch detection

Researchers employ these datasets to evaluate and compare tool accuracy, precision, recall, and F1-score. For example, PS³ achieved 0.82 precision, 0.97 recall, and 0.89 F1-score on its curated dataset (Zhan et al., 2023). PPT4J reported an F1-score of 98.5% and an in-the-wild accuracy of 89.7% with zero false positives (Pan et al., 2023). BinGo demonstrated overall accuracy of 80.77% and F1-score of 0.759, with detailed breakdowns across configurations (He et al., 2023). BinPool is positioned as a resource for future benchmark extensions and expansion (Arasteh et al., 27 Apr 2025).

7. Limitations and Future Directions

While current datasets cover substantial CVE/CWE diversity, several limitations remain. BinPool (Arasteh et al., 27 Apr 2025) notes insufficient training samples per CWE for certain classes, suggesting ongoing expansion and addition of richer metadata (inter-procedural flows, symbolic error traces). BinGo (He et al., 2023) observes compiler/optimization-induced discrepancies in detection accuracy, advising further study into normalization and cross-compilation invariance. The empirical study (Li et al., 7 Sep 2025) highlights the challenges of vanilla LLMs on binary patch detection and indicates that continued research into fine-tuned representations and hybrid reasoning approaches is warranted.

Further directions include integration of more challenging artifacts (closed–source binaries, advanced compilation strategies), expansion of modality coverage (adding symbolic traces, dynamic execution), and enhancement of ground-truth mapping granularity. Continuous updates and richer annotations will strengthen these datasets' utility across binary security research.

A binary patch dataset represents a vital infrastructure for the development, evaluation, and benchmarking of binary analysis and security tools. By providing well-labeled, diverse, and realistic samples across multiple compilation domains, such datasets facilitate advances in vulnerability detection, semantic patch testing, binary similarity, and learning-based security research.