LLM4Decompile: Neural Binary Decompilation

Updated 21 January 2026

LLM4Decompile is a research initiative that uses transformer-based models to decompile binaries into readable, compilable source code.
It leverages expansive datasets and benchmarks to boost recompilability, re-executability, and semantic robustness of decompiled outputs.
The approach integrates alignment, in-context learning, and CFG augmentation to overcome the limitations of traditional and early neural decompilers.

LLM for Decompilation (LLM4Decompile) refers to a line of research and open-source toolkits centered on using transformer-based LLMs to decompile binary program representations—such as assembly code—back into readable, compilable, and semantically faithful high-level source code (typically in C or Solidity). LLM4Decompile advances decompilation from rule-based heuristics and neural machine translation to scalable, data-rich, and context-aware modeling, delivering improvements across re-executability, readability, and robustness to compiler optimizations. The approach has catalyzed development of benchmarks, datasets, and evaluation methodologies specifically for LLM-backed decompilers.

1. Foundations and Motivation

Traditional decompilers, such as Ghidra, Hex-Rays, or RetDec, operate with hand-engineered control-flow and variable recovery heuristics, producing pseudo-high-level code that is often unreadable or uncompilable due to missing types, naming loss, and compiler optimizations that obscure logic and data layout (Tan et al., 2024). Neural decompilation re-casts this challenge as a translation problem, mapping binary/assembly representations to high-level source, but early methods suffered from limited scalability and robustness, especially on optimized code (Cao et al., 2023, Katz et al., 2019).

LLM4Decompile systematically leverages transformer-based LLMs (e.g., DeepSeek-Coder, Llama3, CodeLlama, GPT-4) pre-trained and fine-tuned on million-scale, architecture-matched, binary-source code pairs (Tan et al., 2024, Tan et al., 19 May 2025). Unlike prior neural or rule-based methods, these models can synthesize semantically rich, readable C code and, when paired with modern datasets and prompt engineering, dramatically increase the rate of producing source code that can be recompiled and functionally matches the original binary.

2. Dataset Engineering and Benchmark Design

A central enabling factor for LLM4Decompile is the construction of robust, real-world aligned training and evaluation datasets. Decompile-Bench (Tan et al., 19 May 2025) and related corpora (Gao et al., 16 May 2025, Manuel et al., 2024) provide millions of function-level binary-source pairs, ensuring:

Multi-level compiler optimization coverage (O0–O3), to address code transformations such as inlining, register allocation, and control-flow flattening.
Accurate mappings between functions and their source, using DWARF debug traces, tree-sitter parsing, and matching over line coverage with complex deduplication (minhash/LSH).
Diversity in software domains, architectures, and code complexity (measured via cyclomatic complexity, Halstead metrics).

Evaluation benchmarks (Decompile-Eval, MBPP, ExeBench, HumanEval-Decompile) emphasize executable correctness: a decompiler’s output is scored by recompilability, re-executability (test pass/fail), and structured readability (R2I, edit similarity, CodeBLEU, AST-similarity) (Tan et al., 19 May 2025, Tan et al., 2024).

Dataset	#Pairs	Optimization Levels	Benchmarks
Decompile-Bench	2,000,000	O0–O3	Decompile-Bench-Eval
DeBinVul	150,872	O0, O3, (multi-arch)	Binary vulnerability det.
ExeBench	>10,000	O0–O3	Patch-based/real code

These resources address data leakage, function inlining, variable renaming, and coverage of complex idioms—the bottlenecks of earlier decompiler learning attempts.

3. LLM Architectures, Training, and Enhancement Strategies

LLM4Decompile models are initialized from open-source code LLMs (e.g., DeepSeek-Coder, Llama3, CodeLlama) and fine-tuned with sequence-to-sequence cross-entropy or translation-objective losses on (assembly, source code) pairs (Tan et al., 2024, Tan et al., 19 May 2025). Model sizes range from 1.3B to 33B parameters, with observed scaling benefits in capturing long-range dependencies and accurate reconstruction of complex control/data flow.

Training regimes include:

S2S-only training, where only the source code tokens receive loss, yielding improved recompilability over language modeling alone.
End-to-end fine-tuning (e.g., LLM4Decompile-End), directly mapping assembly to C.
Refinement strategies (e.g., LLM4Decompile-Ref), where models are further trained on outputs of classical decompilers (e.g., Ghidra) mapped to source code, amplifying post-processing capabilities.

Further algorithmic enhancement arises from:

Fine-grained alignment: statement-level pairing (FAE) using debug info, boosting stepwise alignment during supervised learning (Feng et al., 2024).
In-context learning (ICL) with retrieval of similar binary/source pairs as prompt context (ICL4Decomp), or inclusion of explicit optimization rule descriptors relevant to observed binary features (Wang et al., 3 Nov 2025).
Self-improving demonstration (sc²dec): via recompilation of LLM outputs and re-injection as demonstration pairs, improving sample efficiency and generalizability (Feng et al., 2024).
Explicit augmentation with CFGs or IRs to address control-flow complexity (Wang et al., 18 Sep 2025, Cao et al., 2023).
Tools for dataset construction, mapping, and export (e.g., CodableLLM), supporting automated alignment of decompiled and source functions (Manuel et al., 2 Jul 2025).

4. Metrics and Experimental Performance

LLM4Decompile’s effectiveness is evaluated with metrics targeting practical usability:

Recompilability rate ( $R_{comp}$ ): proportion of decompiled outputs that compile cleanly.
Re-executability rate ( $R_{exec}$ ): outputs that compile and pass original test suite(s).
Readability/idiomaticity: via R2I (Relative Readability Index), CodeBLEU, edit similarity, and human/LLM-as-Judge assessment.
Vulnerability identification/classification (DeBinVul): F1 for detection/class, description quality, and semantic gap analysis between source/decompiled binaries (Manuel et al., 2024).

Reported results demonstrate:

LLM4Decompile-6B achieves $R_{comp}=87\%$ , $R_{exec}=21.4\%$ , outperforming GPT-4o (14\% $R_{exec}$ ) by over 52\% to 100\% relative margin (Tan et al., 2024).
Finetuning on Decompile-Bench yields +28.8\% to +21.4\% improvement in re-executability and +15–30\% in other metrics on HumanEval/MBPP (Tan et al., 19 May 2025).
Methods combining multi-level alignment (FAE) and self-constructed context (sc²dec) achieve new state-of-the-art (55.03\% $R_{exec}$ on Decompile-Eval) (Feng et al., 2024).
Retrieval-based ICL (ICL4Decomp) improves $R_{exec}$ by ~40\% (e.g., from 32.3\% to 42.4\% on HumanEval O1) over supervised models alone, with further gains from rule-based context (Wang et al., 3 Nov 2025).
ReF Decompile’s relabeling and function-call augmentation reaches 61.4\% $R_{exec}$ , exceeding comparable end-to-end models (Feng et al., 17 Feb 2025).

Model	HumanEval $R_{exec}$ (%)	MBPP $R_{exec}$ (%)
GPT-4o	13.4	19.9
Ghidra	13.6	16.0
LLM4Decompile-End	16.2	20.5
LLM4Decompile-DCBench	20.9	24.9
ReF Decompile	61.4	-
llm4decompile-6.7b + FAE+sc²dec	55.0	-
ICL4Decomp	54.3	-

Models that integrate context, alignment, or refinement systematically outperform not only base LLMs but also commercial decompilers in understandability, though the latter remain ahead on strict functional correctness in certain industrial benchmarks (Gao et al., 16 May 2025).

5. Specialized Applications and Robustness

LLM4Decompile methods generalize beyond generic C/x86 binaries to:

WebAssembly decompilation (e.g., WaDec), achieving over 50\% recompilability and re-execution for real-world wat-C pairs, reducing code bloat by 97\% over state-of-the-art (She et al., 2024).
Smart contract reverse engineering (EVM), reconstructing semantically meaningful and readable Solidity from bytecode via IR-guided, LoRA-adapted LLMs (David et al., 24 Jun 2025) and dependency graph–conditioned LLM prompting (SmartHalo) (Liao et al., 15 Jan 2025).
Binary vulnerability analysis, where specialized fine-tuning yields sizable improvements (e.g., F1 up to 0.94 for bug identification/classification) and narrows the source-vs-decompiled semantic gap (Manuel et al., 2024).

Systems are also evaluated—or explicitly designed—for robustness to obfuscated binaries (CFG flattening, bogus CF, instruction substitution, etc.) (Wang et al., 18 Sep 2025), with algorithmic abstractions (SALT, graph-IR) delivering measurable increases in test-case pass rates under all tested obfuscation regimes.

6. Open Problems and Ongoing Challenges

Despite the gains, core limitations remain:

Semantic fidelity still trails industrial, rule-based decompilers for the strictest functional equivalence—on DecompileBench, LLMs have up to 52.2\% lower correctness on branch-coverage metrics, despite higher readability (Gao et al., 16 May 2025).
Model hallucination, context fatigue (especially on long or complex functions), and type/inference errors remain practical barriers; explicit symbolic or formal validation is rarely integrated into LLM approaches.
Fine-tuning and alignment performance degrade on architectures/languages or optimization passes absent from the training data, though diverse, multi-architecture datasets (DeBinVul) are mitigating this (Manuel et al., 2024, Tan et al., 19 May 2025).
Dependency on accurate assembly parsing and pre-processing, as errors introduced here propagate irrecoverably into LLM outputs (Cao et al., 2023).
Cross-function, multi-file, or project-scale decompilation remains below the single-function scope of most benchmarks.
Real-world application in malware reverse engineering, CI systems, or firmware analysis is limited by the need for human-in-the-loop verification and tooling around LLM pipelines (Pordanesh et al., 2024).

Future research aims at explicit integration of program analysis, retrieval-augmented and rule-informed prompting, and multi-modal data sources (CFGs, IRs, symbolic traces), as well as systematic extension to broader language/ISA coverage and dynamic validation frameworks.

7. Data, Open-Source Assets, and Reproducibility

LLM4Decompile artifacts—pre-trained and fine-tuned models, datasets, and evaluation harnesses—are released openly for community adoption and further research (Tan et al., 2024, Tan et al., 19 May 2025). Public repositories include:

GitHub/HuggingFace: source code, full Decompile-Bench dataset (2M pairs), LLM checkpoints, evaluation scripts (Tan et al., 19 May 2025, Tan et al., 2024)
Integration tools: CodableLLM for dataset mapping and export (Manuel et al., 2 Jul 2025)
Open-source LLMs (DeepSeek-Coder, CodeLlama) as starting points for replication or further fine-tuning

Procedures for reproduction follow standard ML conventions: script-driven fine-tuning (e.g., batch size, learning rate, epochs), well-specified evaluation protocols, fixed random seeds for generation, and reference to canonical data splits.

LLM4Decompile thus defines a comprehensive, extensible foundation for research and development in binary decompilation with LLMs, demonstrating measurable advances in correctness and readability and setting a baseline for future innovations in neural reverse engineering.