Binary SPD Benchmark for Patch Detection
- Binary SPD Benchmark is a suite of curated datasets and evaluation protocols for detecting security patches in binary code.
- It includes a large-scale open-source dataset and a cross-domain set simulating closed-source patching scenarios, ensuring rigorous model testing.
- Empirical results highlight superior performance of LLM-based methods using pseudo-code, while revealing challenges in pointer and memory management vulnerability detection.
Binary SPD Benchmark refers to publicly available datasets and evaluation protocols designed for the task of Security Patch Detection (SPD) on binary code artifacts. This benchmark family underpins empirical study of algorithmic advances—most notably, representation learning approaches using LLMs—for distinguishing security-relevant binary patches from non-security changes across real-world, closed-source–style software distributions.
1. Definition and Motivation
Binary Security Patch Detection (SPD) is the problem of determining, given a pair of binary code artifacts (typically pre- and post-patch versions at function granularity), whether a candidate patch corrects a security vulnerability. Letting denote the binaries and the matched, changed function pair, the classifier operates as: $\Classifier\bigl(\langle f_i^{pre},f_i^{post}\rangle\bigr) \in \{\text{security},\text{non-security}\}$ where candidate pairs are extracted by binary diffing tools such as BinDiff (Li et al., 9 Jan 2026). SPD on source code is well studied; binary SPD is necessitated by the prevalence of closed-source software which typically releases patches only as binaries. Prior methods for SPD have focused on source-code diffing and are inapplicable to proprietary distributions lacking source availability (Li et al., 7 Sep 2025).
A structured binary SPD benchmark enables the quantitative evaluation and comparison of learning-based SPD approaches—especially LLM-driven and graph-structured model families—under realistic, cross-project, and cross-domain settings.
2. Dataset Construction and Composition
The primary binary SPD benchmarks currently available consist of two complementary datasets:
a) Large-Scale Benchmark (19,448 Samples) (Li et al., 7 Sep 2025)
- Drawn from five open-source C/C++ projects: Linux kernel, FFmpeg, Git, PHP, Libav.
- Each function-level sample comprises disassembled assembly and decompiled pseudo-code versions, automatically produced via IDA Pro.
- Security patch labels (positive: 8,311, negative: 11,137) transferred from source patches.
- Data is compiled at five optimization levels (O0, O1, O2, O3, Os).
- Example representations:
- Assembly: "mov eax, [ebp+0Ch]"; "add eax, 8h"
- Pseudo-code: "void foo(int size) {...}" → "void foo(int size+8) {...}"
- Train/val/test split: 80:10:10; deduplicated at the commit level.
b) Cross-Domain, Cross-Project Evaluation Benchmark (1,720 Samples) (Li et al., 9 Jan 2026)
- Function-level diffs extracted from binary patching of closed-source–style projects: ImageMagick (Image), TcpDump (Network), Qemu (Virtualization), Radare2 (Security), Slurm (Cluster).
- 1,010 security patch samples; 710 non-security.
- Compilation across five GCC optimization levels; IDA Pro yields both assembly and pseudo-C (overall decompilation success 74%).
- Test set is project- and domain-disjoint from prior SPD corpora (Linux, FFmpeg, Php, Git, Libav), enforcing out-of-distribution conditions.
The construction pipeline standardizes binary lifting, diff alignment, and split integrity. All benchmarks are publicly released alongside experimental code (Li et al., 7 Sep 2025).
3. Task Formulation and Evaluation Protocols
Binary SPD, as formally instantiated in these benchmarks, is a function-level, pairwise classification problem. For each function pair detected by binary diffing, models must predict whether the underlying transformation repairs a security vulnerability.
Evaluation metrics adopted across works (Li et al., 7 Sep 2025, Li et al., 9 Jan 2026) include:
- Accuracy: $\Acc = \frac{TP + TN}{TP + TN + FP + FN}$
- Precision: $\Prec = \frac{TP}{TP + FP}$
- Recall: $\Rec = \frac{TP}{TP + FN}$
- F1-Score: $F1 = \frac{2 \times \Prec \times \Rec}{\Prec + \Rec}$
- False-Positive Rate (FPR):
Instruction-following failure rates (fraction of model outputs not conforming to the required prediction format) are also reported for LLMs (Li et al., 7 Sep 2025).
4. Baseline Approaches and Empirical Results
Extensive baselines have been benchmarked on both datasets, covering:
- LLM-based methods (e.g., Qwen2.5-Coder, LLM-Compiler, Yi-Coder, GPT-4o)
- Graph-based models (e.g., BinGo combining assembly-level CFGs and GNNs)
- Source code–adapted SPD models (e.g., PatchRNN, LLMDA)
Finetuned LLMs (via Low-Rank Adaptation) consistently outperform vanilla prompting, especially on pseudo-code representations (Li et al., 7 Sep 2025).
Representative results on the 1,720-sample test benchmark (Li et al., 9 Jan 2026):
| Method | Accuracy | F1 | FPR |
|---|---|---|---|
| StriderSPD (joint-repr. LLM) | 0.854 | 0.885 | 0.293 |
| Yi-Coder-9B-Chat (LLM) | 0.758 | 0.818 | 0.477 |
| LLM4Decompile-9B-v2 (LLM) | 0.542 | 0.557 | 0.348 |
| BinGo (graph CFG+GNN) | 0.610 | 0.751 | 0.944 |
| PatchRNN (source code) | 0.553 | 0.607 | 0.537 |
On the large-scale dataset (Li et al., 7 Sep 2025), best pseudo-code–finetuned LLMs achieve and .
Pseudo-code consistently outperforms assembly as an input representation, yielding average accuracy and improvements of and , and FPR reductions of (Li et al., 7 Sep 2025). Augmentation with source-code patch data further benefits especially smaller models.
5. Domain and Complexity Coverage
The benchmarks are characterized by intentional coverage of diverse application domains:
- (Li et al., 7 Sep 2025) covers Operating Systems (Linux), Multimedia (FFmpeg, Libav), Version Control (Git), and Web (PHP).
- (Li et al., 9 Jan 2026) targets Image processing, Networking, Virtualization, Security tooling, and Cluster management.
No domain is shared between the new benchmark test set and the training set, ensuring genuine cross-domain generalization measurement. Patch types span 11 classes and 26 CWEs; buffer overflows account for 67.5% of security patches in (Li et al., 9 Jan 2026), with resource management, control-flow, input validation, and memory management also represented.
Optimization-level statistics reveal function complexity variations (e.g., O0: avg CFG nodes = 95.7, edges = 147.6, pseudo-C tokens = 4,049; O3: nodes = 101.7, edges = 158.3, tokens = 4,628.2).
6. Key Findings and Challenges
The released benchmarks empirically support several domain-specific conclusions:
- Fine-tuned LLMs achieve high binary SPD accuracy only when trained on pseudo-code, due to its semantic proximity to source code (embedding similarity: Dist(src, pseudo) = 0.03 vs. Dist(src, assembly) = 0.37; code naturalness CE: pseudo $0.882$, assembly $0.960$) (Li et al., 7 Sep 2025).
- Prompted, unadapted code LLMs (even at 7B–9B scale) perform poorly (F₁ below 0.60; non-compliance with output formats up to 37% on pseudo-code).
- Graph-structured methods (e.g., BinGo) can achieve high recall (1.00) but at the cost of substantial FPR (0.944).
- Despite strong overall performance, memory-management and pointer-related vulnerabilities remain frequent failure cases for all LLM variants.
- Cross-project and cross-domain generalization is non-trivial: benchmarks enshrine OOD test splits to prevent project or domain leakage.
A plausible implication is that further improvement in binary SPD requires either finer-grained domain adaptation or hybrid models incorporating graph-structural and LLM-style representations.
7. Research Opportunities and Benchmark Extensions
Open directions outlined by these benchmarks include:
- Extending datasets to cover other architectures (ARM, MIPS) and programming languages (e.g., Go, Rust) beyond C/C++ (Li et al., 7 Sep 2025).
- Scaling closed-source–style datasets to additional domains and larger codebases.
- Adopting retrieval-augmented generation or domain-aware curriculum learning to improve detection rates for subtle vulnerabilities (memory safety, pointer misuse).
- Investigating specialized architectural adapters to align low-level semantics (assembly, CFG) with LLM token space more effectively (as in StriderSPD) (Li et al., 9 Jan 2026).
All datasets, challenge splits, and relevant experimental code are available for reproducibility and future benchmarking at the respective public repositories (Li et al., 7 Sep 2025, Li et al., 9 Jan 2026).
References
- "Empirical Study of Code LLMs for Binary Security Patch Detection" (Li et al., 7 Sep 2025)
- "StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection" (Li et al., 9 Jan 2026)