Binary Patch-Level Classifier

Updated 15 September 2025

The paper demonstrates that fine-tuning on pseudo-code significantly improves binary security patch detection, achieving over 91% accuracy with low false positive rates.
Binary patch-level classification is a process that analyzes compiled patch pairs using disassembly and decompilation, enabling detection of security-relevant changes.
The study applies novel prompting strategies and low-rank adaptation (LoRA) fine-tuning to enhance LLM performance, addressing challenges in handling compiler optimizations.

A binary patch-level classifier is a specialized machine learning model designed to distinguish between two classes—such as security-relevant versus non-security patches—in code or data represented at the patch level. This formulation is crucial in domains where only binary artifacts are available, source code is inaccessible, or changes need to be detected in compiled code rather than directly in textual source. The recent growth in the use of code LLMs has opened new avenues for binary patch-level classification, especially in software security workflows where rapid and reliable characterization of binary patches is critical. Below, the empirical landscape of binary patch-level classification, its data requirements, methods, challenges, and state-of-the-art outcomes are detailed, as established in "Empirical Study of Code LLMs for Binary Security Patch Detection" (Li et al., 7 Sep 2025).

1. Dataset Construction and Representation Levels

To rigorously evaluate binary patch-level classifiers, large-scale patch datasets are required that provide paired pre-patch and post-patch binaries. The cited work constructs such a dataset by compiling binaries from open-source repositories such as ReposVul and PatchDB. Each binary is processed at multiple optimization levels, following an automated multi-optimization strategy. Central to high-accuracy binary patch detection is the choice of representation:

Assembly code: Obtained by disassembling binaries (e.g., via IDA Pro). This representation captures low-level instructions and control flow but lacks higher-order semantic structure.
Pseudo-code: Generated through decompilation, capturing not only structural but also semantic attributes closer to human-readable source code.

The dataset includes 19,448 labeled patch pairs (8,311 security patches, 11,137 non-security patches), supporting robust multi-project and multi-level evaluation. This large sample size ensures diversity across projects and compilation settings and supports generalization to real-world software artifacts.

2. Code LLM Evaluation and Prompting Strategies

The empirical study evaluates 19 open-source code LLMs covering a spectrum from 0.5B to 9B parameters, plus two foundation models, on binary security patch detection (SPD). The task is to classify a given patch pair as security-relevant or not. Three prompting strategies are employed:

Zero-shot: The vanilla model is prompted directly to classify the patch.
Chain-of-Thought (CoT): The model is encouraged to articulate reasoning before classification.
Self-correction: Iterative prompts correct the model’s output.

Performance is quantified by accuracy, F1 Score, and False Positive Rate:

Representation	Accuracy (Best Model)	F1 Score	False Positive Rate
Assembly code	Variable	< Pseudocode	> Pseudocode
Pseudo-code	~0.915	~0.897	~0.058

Vanilla LLMs, even when prompted with sophisticated strategies, frequently fail to adhere to the binary classification task, outputting incomplete or non-binary responses. This limitation underscores the necessity for direct injection of domain knowledge beyond prompting.

3. Fine-Tuning for Binary SPD Domain Knowledge

Given the inadequacy of prompting alone, fine-tuning emerges as a primary strategy for equipping LLMs for binary patch-level classification. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is used to adapt models to the binary SPD task at both assembly and pseudo-code representation levels.

Fine-tuning on pseudo-code consistently outperforms fine-tuning on assembly code, achieving higher accuracy, F1 Scores, and lower false positive rates, regardless of compiler optimization level.
Augmentation of training data by combining pseudo-code with source code data benefits smaller-scale models, suggesting a practical approach for resource-constrained deployments.

4. Algorithmic and Evaluation Foundations

The study formalizes both the evaluation workflow and key metrics:

Automated Compilation (Algorithm 1): Procedures to compile pre- and post-patch binaries at multiple optimization levels and extract representations for LLM input.
Metrics:
- Accuracy: $(TP + TN) / (TP + TN + FP + FN)$
- F1 Score: $2 \cdot Precision \cdot Recall / (Precision + Recall)$
- False Positive Rate: $FP / (FP + TN)$

Quantitative comparison using these metrics demonstrates the superiority of fine-tuned LLMs—particularly on pseudo-code—over vanilla models in binary patch detection.

5. Challenges and Practical Implications

The binary patch-level classifier task faces multiple technical challenges:

Representation gap: Assembly code is less amenable to LLMs pre-trained on (pseudo-)code, resulting in reduced performance; pseudo-code bridges this gap, making pretraining effective.
Instruction-following: Most vanilla code LLMs lack inherent understanding of the patch-level binary SPD domain and ignore instructions to output strict binary classifications.
Compiler and optimization-induced variability: Models fine-tuned on assembly code display variance across optimization levels, while those trained on pseudo-code are more robust.

This suggests that extensive fine-tuning, especially on semi-structured representations, is essential for reliable binary patch classification in domains lacking direct source code.

6. State-of-the-Art Outcomes and Future Directions

Fine-tuned code LLMs—particularly those adapted via LoRA—achieve state-of-the-art results on the binary SPD task when applied to pseudo-code representations. For example, the LLM4Decompile-9B-v2 model attains accuracy up to 91.5%, F1 scores near 89.7%, and a false positive rate of 5.8% on this difficult classification problem.

A plausible implication is that future work will focus on further improving semantic representation of binary patches, leveraging data augmentation from source-level corpora, and advancing adaptation methods for ever-larger code LLMs. Combining assembly-level inputs with pseudo-code could help models generalize over diverse binaries while maintaining prediction fidelity. Robust benchmarking and dataset expansion—at scale and diversity—will continue to be foundational for high-performance binary patch-level classification.

PDF Markdown Chat (Pro)

References (1)

Empirical Study of Code Large Language Models for Binary Security Patch Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Binary Patch-Level Classifier.