LLMs for Binary Code Understanding

Updated 31 October 2025

Large language models for binary code understanding are transformer-based systems that interpret binaries via assembly, pseudocode, raw bytes, and control flow graphs.
They use specialized benchmarks and fine-tuning, including LoRA and contrastive learning, to achieve robust performance in tasks like vulnerability detection and malware analysis.
Advances in LLMs facilitate reverse engineering and security patch detection, yet challenges remain with complex obfuscation and semantic transformations.

LLMs for binary code understanding refer to transformer-based neural architectures trained or adapted to interpret, summarize, classify, and analyze executable code and its representations when source code is unavailable. The domain encompasses reverse engineering, security patch detection, vulnerability classification, malware analysis, decompilation, and software supply chain assurance. LLMs address the semantic gap inherent in binaries, which lack abstract syntax, comments, and symbolic metadata present in source code. Recent advances build upon empirical studies that rigorously benchmark code-focused and generalist LLMs, fine-tuning strategies, and multi-modal data representations, with quantitative performance analyses guiding methodological innovation.

1. Foundations and Representations for Binary Code Understanding

LLMs operate on binary code after transformation to varied intermediate representations, each with distinct semantic utility:

Assembly Code: Obtained via disassembly; preserves low-level control and data flow detail but is semantically sparse. Assembly code is predominantly used for tasks requiring instruction-level fidelity, such as deobfuscation and similarity detection (Jiang et al., 2023).
Pseudo-Code: Decompiler tools (e.g., IDA Pro) lift binaries to C-like textual forms, enabling LLMs to leverage pretrained source code knowledge. Pseudo-code input demonstrably improves semantic comprehension, function recovery, and patch detection, due to its alignment with typical LLM pretraining data (Li et al., 7 Sep 2025).
Raw Bytes/IR: Byte and micro-code (IR) forms enable low-level pattern recognition, though effective semantic understanding requires normalization and structurally aware prompts (Jin et al., 2023).
Control Flow Graphs (CFGs): Function-level CFGs encapsulate dynamic execution structure, aiding in malware analysis and diff summarization pipelines (Udeshi et al., 28 Sep 2025).

Representation selection crucially affects model performance, with pseudo-code and rich call graphs mediating improved results in practical reverse engineering and patch detection.

2. Benchmarks, Evaluation Metrics, and Experimental Paradigms

Recent work establishes rigorous, large-scale benchmarks and task suites:

BinMetric (Shang et al., 12 May 2025): Six binary analysis tasks, including call-site reconstruction, decompilation, code summarization, signature recovery, algorithm classification, assembly synthesis. Evaluation spans 1,000 questions, 20 open-source C projects, multi-architecture binaries.
BinSum (Jin et al., 2023): 557K binary functions from 44 GNU projects; four levels of representations per function, enabling semantic similarity analysis via embedding-based metrics.
DeBinVul (Manuel et al., 7 Nov 2024): 150,872 multi-architecture, multi-optimization decompiled functions for vulnerability detection, classification, description, and function recovery.

Metrics are domain-specific, combining text similarity (ROUGE-L, METEOR, CodeBLEU), behavioral correctness (re-executability, recompilability rate), token-level F1 (name and signature recovery), and semantic separation scores (cosine similarity, FSS-driven triage):

Task	Metric(s)
Patch/Vulnerability Detection	Accuracy, F1, FP Rate
Code Summarization	BLEU-4, METEOR, ROUGE-L, Semantic Cosine
Decompilation	Re-executability Rate, Syntax correctness
Algorithm Classification	Accuracy
Assembly Generation	Syntax/IO correctness, ROUGE-L

Such benchmarks enable robust model comparison and establish leaderboards for future research.

3. Model Architectures, Fine-Tuning Strategies, and Domain Adaptation

Generic source code LLMs (e.g., CodeLlama, DeepSeek-Coder, StarCoder2) are insufficient for accurate binary tasks when prompted directly, due to inherent domain gaps and poor instruction following (failure rates up to 0.37 for binary SPD) (Li et al., 7 Sep 2025). Crucial advances center on:

LoRA Fine-Tuning: Efficient injection of binary-specific knowledge, yielding substantial gains for security patch detection and vulnerability analysis (Li et al., 7 Sep 2025, Manuel et al., 7 Nov 2024).
Contrastive Learning: Used to align binary and source code latent spaces, fostering improved similarity detection and cross-optimization invariance (Jiang et al., 2023, Zhang, 2022).
Preprocessing/Normalization: Placeholder tokenization for labels and addresses, ensuring code is self-contained and representationally uniform (Jiang et al., 2023).
Sequence-to-Sequence Objectives: In decompilation, loss is calculated exclusively over output (C code), maximizing behavioral correctness and facilitating re-executability (Tan et al., 8 Mar 2024).

Domain adaptation, particularly via integration of source-code augmented data and representation bridging (pseudo-code), is essential for closing the knowledge gap and attaining robust performance across architectures and compiler optimizations.

4. Task-Specific Achievements and Quantitative Results

Fine-tuned LLMs achieve state-of-the-art results on binary understanding benchmarks, especially when leveraging pseudo-code and source-data augmentation:

Security Patch Detection

LLM4Decompile-9B-v2 (Pseudo-code): Accuracy = 0.915, F1 = 0.897, FP Rate = 0.058
Qwen2.5-Coder-7B-Instruct (Pseudo-code): Accuracy = 0.877, F1 = 0.851
Source-augmented pseudo-code (small models): +0.147 Accuracy, +0.187 F1 improvement (Li et al., 7 Sep 2025)

Vulnerability Analysis

CodeLLaMa, Llama3, CodeGen2 (DeBinVul, fine-tuned): Accuracy 0.85–0.91, F1 0.87–0.94 (Manuel et al., 7 Nov 2024)
Generalization: Robust results (cosine similarity > 0.78) across architectures/optimizations.

Binary Diff Summarization

Precision: 0.98; Recall: 0.64 for malware detection in supply chain attack scenario.
FSS separation: 3.0-point median difference between malicious and benign, enabling automated triage (Udeshi et al., 28 Sep 2025).

Summarization and Name Recovery

CodeLlama-34B (Name recovery): F1 = 27.59%
ChatGPT (Summarization): BLEU-4 = 7.37, METEOR = 28.13, ROUGE-L = 23.80 (Shang et al., 15 Apr 2024)

Decompilation

LLM4Decompile-6B: Re-compilability 87%, re-executability 21% (vs GPT-4's 14%) (Tan et al., 8 Mar 2024).

Assembly Deobfuscation

Bogus Control Flow: Low resistance; best LLMs can often deobfuscate.
Combined Techniques: Universally robust; all models failed, highlighting upper bounds for current LLMs (Tkachenko et al., 26 May 2025).

5. Limitations, Remaining Challenges, and Critical Analyses

Despite notable advances, persistent challenges remain:

Memory-related vulnerabilities: LLMs struggle to reliably detect security patches dealing with complex memory management (Li et al., 7 Sep 2025).
Semantic equivalence under program transformations: LLMs fail in 29–41% of cases, even when prompted about context such as copy propagation or constant folding; behaviors generalize to the binary domain (Laneve et al., 31 Mar 2025).
Obfuscated binaries: Performance drops sharply with loss of symbolic information or sophisticated obfuscation (instruction substitution, control flow flattening); universal failure for combined obfuscation (Tkachenko et al., 26 May 2025).
Evaluation metric suitability: Standard NLP metrics (BLEU, ROUGE) fail to capture utility for reverse engineers; semantic embedding similarity and task-specific behavioral criteria (e.g., test pass rates) are preferred (Jin et al., 2023, Tan et al., 8 Mar 2024).
Resource efficiency: Large LLMs outperform small ones but inference and fine-tuning costs, as well as token/context length constraints, remain practical bottlenecks.

6. Implications, Applications, and Future Directions

LLMs are poised to become central tools in automated binary code analysis, supporting workflows in security patch vetting, supply chain malware review, reverse engineering, and vulnerability triage. Key implications include:

Human-AI Collaboration: LLMs reduce expertise barriers for many tasks, but require human guidance for highly obfuscated or semantically complex binaries (Tkachenko et al., 26 May 2025).
Benchmarking and Dataset Release: Open datasets such as BinMetric, BinSum, and DeBinVul are catalyzing reproducibility and competitive research in binary code understanding.
Model Innovation: Future work will exploit longer context windows, multi-modal input fusion (combining CFGs and dynamic traces), robust architecture/optimization generalization, and integration with formal code optimization tools for improved semantic reasoning.
Security Applications: Automated software update vetting (malware detection), patch triage, and vulnerability discovery in closed-source or proprietary systems are now practicable at scale.

LLMs bring tangible semantic advances to binary code comprehension, yet their limits are sharply defined by obfuscation, symbolic sparsity, and deep semantic invariants. Progress in architecture, training paradigms, representation engineering, and metric development will determine future impact across reverse engineering and software supply chain security.