LLMs for Binary Code Understanding

Updated 3 November 2025

Large Language Models for Binary Code Understanding are deep neural networks that semantically interpret compiled binaries to enable reverse engineering, vulnerability detection, and decompilation.
Benchmark-driven studies show that using decompiled pseudo-code markedly boosts performance metrics like F1 and BLEU scores in tasks such as function name recovery and code summarization.
Integrating prompt engineering, domain-specific fine-tuning, and targeted data augmentation leads to significant improvements in vulnerability detection, reverse engineering efficiency, and patch classification.

LLMs have become central to the advancement of automated binary code analysis—a domain encompassing reverse engineering, vulnerability discovery, patch analysis, and supply chain security verification—by enabling semantic interpretation of compiled binaries where source code is unavailable or stripped of symbolic information. This research area investigates the fundamental capabilities, limitations, and implementation methodologies of LLMs in diverse binary code tasks, including summarization, decompilation, function signature recovery, malware detection, and vulnerability assessment.

1. Foundational Benchmarks and Evaluation Paradigms

Establishing standardized, realistic benchmarks is critical to evaluating LLM performance in binary code understanding. Several prominent benchmarks have emerged:

BinSum (Jin et al., 2023): Comprises 557,000 binary functions from 44 projects, annotated with ground-truth summaries. Each function is represented in four modalities: raw bytes, assembly, intermediate representation (IR), and decompiled code from multiple decompilers. Enables large-scale evaluation of binary summarization with ground truth grounded in developer comments.
BinMetric (Shang et al., 12 May 2025): Offers 1,000 expert-crafted binary analysis scenarios spanning six canonical tasks—call-site reconstruction (CSR), decompilation (DEC), signature recovery (SR), summarization (BCS), algorithm classification (AC), and assembly synthesis (AIG)—from 20 open-source projects.
Other specialized datasets: DeBinVul (Manuel et al., 7 Nov 2024) targets binary vulnerability analysis with 150,872 decompiled samples and annotated task variants; binary patch datasets for security patch detection (Li et al., 7 Sep 2025); supply chain diff summarization with malware injection (Udeshi et al., 28 Sep 2025); and open-source decompilation datasets (Tan et al., 8 Mar 2024).

These benchmarks emphasize multi-architecture and multi-optimization coverage (x86, x64, ARM, MIPS; O0–O3), robust ground-truth alignment, and tackle representation challenges for stripped, obfuscated, or symbol-poor binaries.

2. Key Binary Code Understanding Tasks and Metrics

Binary code understanding via LLMs is operationalized through several core tasks, each evaluated with task-suited metrics:

Function Name Recovery: Predicting source-level function names from stripped binaries using decompiled code as input. Evaluated via token-level Precision, Recall, and F1-score:

$\text{F1-score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Binary Code Summarization: Generating concise natural language descriptions for binaries, evaluated via BLEU-4, METEOR, ROUGE-L, and—importantly—semantic embedding-based metrics. The semantic similarity metric leverages average-pooling across token embeddings followed by cosine similarity:

$sim(S, \hat{S}) = \frac{e_S \cdot e_{\hat{S}}^\top}{\|e_S\| \times \|e_{\hat{S}}\|}$

Binary Lifting & Decompilation: Mapping assembly to high-level source code (C-like). Benchmarks such as CodeBLEU are employed:

$\text{CodeBLEU} = \lambda_1 \cdot \text{n-gram} + \lambda_2 \cdot \text{syntax} + \lambda_3 \cdot \text{data-flow} + \lambda_4 \cdot \text{semantic}$

Algorithm Classification & Assembly Synthesis: Labeling decompiled code with algorithmic categories (simple accuracy), or synthesizing correct assembly from specification (syntax/execution correctness).
Vulnerability Analysis & Name Recovery: In addition to summary tasks, LLMs are benchmarked on vulnerability identification, classification (CWE assignment), and description—using F1-score and embedding-based similarity.
Security Patch Detection: Binary-level security patch classification, using accuracy, F1, false positive rate (FPR), and explicit "failure rate" for prompt compliance.

3. Input Representations and Their Impact

LLM performance is strongly affected by the representation of binary code provided as input.

Decompiled Pseudo-code consistently yields superior results over raw bytes, assembly, or intermediate representations (Jin et al., 2023, Li et al., 7 Sep 2025, Shang et al., 30 Apr 2025). The fidelity of pseudo-code to the original source—measured via embedding distances and code naturalness metrics—facilitates better transfer of pretraining knowledge.
Symbolic Information is critical: The presence of function names and other identifiers dramatically boosts semantic similarity in summary tasks; stripping them can decrease performance by up to 55% (Jin et al., 2023). Embedding analysis reveals pseudo-code and source code cluster closely, but assembly is distant (Li et al., 7 Sep 2025).
Function Length and Context: Moderate code lengths (400–2000 tokens) optimize LLM output; very long functions degrade F1 and BLEU scores (Shang et al., 15 Apr 2024). Context (callers/callees, additional symbols) further improves semantic inference.
Compiler Optimizations: LLMs are robust to optimization level changes (O0, O3), with less than 1–1.5% variation in metrics across architectures (Shang et al., 30 Apr 2025).

4. Model Architectures, Prompt Engineering, and Fine-Tuning

Instruction/Code-tuned LLMs: Models such as CodeLlama, WizardCoder, DeepSeek-Coder, and GPT family substantially outperform DL-based expert models (BinT5, NER, HexT5, SymLM) in name recovery and summarization—often by 2–80× in F1/BLEU (Shang et al., 30 Apr 2025, Shang et al., 15 Apr 2024).
Prompt Engineering: Automated four-step prompt synthesis (meta-prompts, variant generation, LLM-based rewriting, task-specific selection) can optimize LLM outputs in binary summarization (Jin et al., 2023), with zero-shot prompting generally offering better computational efficiency than few-shot or chain-of-thought.
Fine-Tuning: Binary-domain fine-tuning yields substantial gains:
- Vulnerability identification/classification: Up to +30 percentage points in F1 (Manuel et al., 7 Nov 2024).
- Security patch detection: Fine-tuned pseudo-code models show up to +0.239 F1 over assembly-based counterparts (Li et al., 7 Sep 2025).
- Generalization: Fine-tuned LLMs generalize across architectures and optimizations (performance drops <4%).
Data Augmentation: Mixing pseudo-code with source code for fine-tuning amplifies accuracy, particularly for small models (Li et al., 7 Sep 2025).

5. Application Domains and Specialized Frameworks

LLMs for binary code understanding fuel practical frameworks in:

Reverse Engineering: Function summarization and signature recovery significantly improve analyst productivity for stripped binaries (Shang et al., 15 Apr 2024).
Decompilation: Purpose-built models such as LLM4Decompile (Tan et al., 8 Mar 2024) surpass conventional tools (Ghidra, GPT-4) in both re-compilability and re-executability on code benchmarks.
Vulnerability Detection: Fine-tuned LLMs assign CWE categories and pinpoint vulnerabilities within decompiled binaries, achieving 80–90% average F1 on major categories (Manuel et al., 7 Nov 2024).
Security Patch Detection: Automated detection via LLM requires fine-tuning on pseudo-code; vanilla prompting across 19 code LLMs does not suffice (Li et al., 7 Sep 2025).
Binary Diff Summarization & Supply Chain Security: Multi-component frameworks leverage LLMs for natural language summarization of binary diffs, introducing the Functional Sensitivity Score (FSS) for function-level triage. This enables automated, high-precision detection of malware injections (e.g., the XZ utils backdoor) (Udeshi et al., 28 Sep 2025).
Benchmark-Driven Evaluation: BinMetric provides comprehensive coverage of the binary analysis pipeline for systematic evaluation, highlighting strengths and severe limitations in binary lifting and synthesis (Shang et al., 12 May 2025).

6. Limitations, Trade-offs, and Semantic Challenges

Research identifies persistent technical obstacles:

Representation Mismatch: LLMs trained exclusively on source code are ill-suited for direct binary analysis; pseudo-code bridges the gap (Li et al., 7 Sep 2025).
Symbol Dependence: Heavy reliance on function names/identifiers renders summaries highly vulnerable to manipulation; stripping names cuts semantic similarity sharply (Jin et al., 2023).
Low-level Input Weakness: When presented with assembly/raw bytes/IR, LLMs default to describing low-level operations, largely missing function-level semantics (Jin et al., 2023).
Efficiency vs. Performance: DL-based models infer much faster than LLMs (0.03s vs. 1–10s per sample) but at the cost of generalization and semantic fidelity (Shang et al., 30 Apr 2025).
Assembly Synthesis Gap: All evaluated models fail to consistently generate correct, executable assembly from textual spec (execution correctness near zero) (Shang et al., 12 May 2025).
Obfuscation and Memory Management: LLMs degrade under code obfuscation or in patch detection targeting memory vulnerabilities (Li et al., 7 Sep 2025).
Scale and Resource Consumption: Large benchmarks incur substantial computational cost (e.g., 4B inference tokens, >$10k, hundreds of GPU hours) (Jin et al., 2023).

7. Future Directions and Open Challenges

Papers collectively recommend avenues for progress:

Domain-specific Pretraining: Training LLMs on binary code artifacts—decompiled code, stripped binaries, and assembly—will improve robustness and reduce representation mismatch.
Improved Symbol Recovery: Progress in heuristic or ML-driven symbol recovery for stripped binaries is essential.
Refinement & Ensemble Methods: Combining traditional tools (e.g., Ghidra) with LLM refinement models yields additional correctness improvements (Tan et al., 8 Mar 2024).
Evaluation Metric Innovation: Embedding-based semantic similarity and compilation/execution-based metrics should replace purely textual measures in future benchmarks (Jin et al., 2023).
Long Context Architectures: Enabling reading and interpretation of whole-program binaries for scalable reverse engineering (Shang et al., 30 Apr 2025).
Obfuscation and Robustness: Addressing performance loss on obfuscated or symbol-poor code and expanding to multi-modal approaches (dynamic, control-flow, human annotation) (Shang et al., 30 Apr 2025, Manuel et al., 7 Nov 2024).
Defensive Tooling: Systems built atop LLMs must defend against manipulation vulnerabilities due to naïve identifier reliance (Jin et al., 2023).

This body of research demonstrates that LLMs, with appropriate input representation and targeted fine-tuning, deliver substantial advances in binary code understanding. Nonetheless, enduring semantic and technical limitations persist–especially in binary lifting, code synthesis, handling of obfuscated or stripped inputs, and overall interpretability—necessitating further development of both models and evaluation strategies for broader adoption in software security, reverse engineering, and automated systems.