Decompile-Bench-Eval: Decompilation Benchmarks

Updated 21 January 2026

Decompile-Bench-Eval is a set of standardized benchmark suites and protocols that rigorously measure decompiler performance across various languages and binary formats.
It evaluates systems through metrics like re-compilability, re-executability, and structural similarity to compare neural, LLM-based, and commercial decompilers.
The framework highlights challenges such as semantic loss from optimizations and identifier recovery while guiding future research in robust decompilation methods.

Decompile-Bench-Eval is a set of formal benchmark suites and evaluation protocols designed to rigorously measure the effectiveness of decompilers—systems that recover high-level source code from low-level binaries, bytecode, or stripped executable programs. Widely adopted for academia and industry, Decompile-Bench-Eval frameworks span multiple programming languages (C/C++, Java, Rust, WebAssembly, EVM, quantum circuits), support a diverse range of binary formats and compilation scenarios, and provide standardized, reproducible methodologies for functional, structural, and human-centric assessment of decompilation systems. These benchmarks anchor the comparison of commercial decompilers, end-to-end neural models, LLM-based tools, and hybrid symbolic-neural pipelines in reverse engineering, malware analysis, binary translation, and code generation contexts.

1. Benchmark Suite Design and Composition

Decompile-Bench-Eval encompasses both contest-style micro-benchmarks and large-scale, real-world corpora. The suite covers:

HumanEval and MBPP: 164–1000 hand-written algorithmic problems (originally Python, ported to C/C++), each with unit tests. Binaries are compiled at multiple GCC/Clang optimization levels (O0–O3), producing paired (assembly, original-source) data (Feng et al., 2024, Liu et al., 10 Mar 2025, Tan et al., 19 May 2025).
ExeBench and GitHub2025: Thousands to hundreds of thousands of functions extracted from OSS-Fuzz projects or post-2025 GitHub repos. Functions are isolated with coverage sanitization and compiled with various options, emphasizing leakage-resistance in dataset splits (Gao et al., 16 May 2025, Tan et al., 19 May 2025).
Java Bytecode: 14 open-source Java projects (N=2041 classes), compiled using multiple JVM compilers, with coverage and structural diversity for bytecode decompilation tools (Harrand et al., 2019).
Rust/EVM/WebAssembly: Rust suite exercises generics, traits, concurrency, and error-handling at both debug and release build modes (Zhou, 24 Jul 2025). EVM benchmarks use thousands of Ethereum smart contracts compiled across mainnet and Yul/viaIR pipelines (Lagouvardos et al., 2024). WASM benchmarks synthesize functions (DecFuzzer, PolyBenchC, CHStone) and decompile to C for comparative assessment (Wu et al., 2024).
Quantum Circuits: DeQompile targets canonical quantum functions (GHZ, QFT, QPE) in parametric gate representation for reverse-engineering interpretable Qiskit programs (Xie et al., 11 Apr 2025).

Dataset construction typically ensures disjointness between training and evaluation splits and includes features such as DWARF debug info for source-binary alignment.

2. Evaluation Methodologies and Metrics

Decompile-Bench-Eval standardizes the assessment of decompilers using several rigorous metrics:

Re-Compilability: Fraction of functions whose decompiled output successfully compiles (e.g., with gcc –std=c17 –O2 –c or Clang) (Feng et al., 2024, Harrand et al., 2019).
Re-Executability: Measures functional correctness by executing recompiled output against the original unit tests; only code passing all test cases is credited (Feng et al., 2024, Tan et al., 19 May 2025). Formula:

$\text{Re-Executability} = \frac{ \#\{ \text{functions compiles %%%%0%%%% pass all tests} \} }{ \#\{\text{total functions}\} } \times 100\%$

Test Case Pass Rate (TCP): Fraction of individual test cases passed across all functions:

$\mathrm{TCP} = \frac{1}{\sum_{i=1}^N T_i} \sum_{i=1}^N \sum_{j=1}^{T_i} \mathbf{1}[ out_{\mathrm{gen}}^{(i,j)} = out_{\mathrm{ref}}^{(i,j)} ] \times 100\%$

Structural Readability and Similarity:
- Edit Similarity: Normalized Levenshtein edit distance between generated and reference code (Armengol-Estapé et al., 2023, Tan et al., 19 May 2025).
- Relative Readability Index (R2I): Weighted AST feature score capturing nested depth, identifier length, and idiomatic structures (Tan et al., 19 May 2025, Tan et al., 26 Sep 2025).
- Tree-Edit Distance Score (TED): Normalized tree-edit operations to reconcile code ASTs (Wu et al., 2024).
Semantic Equivalence: Modulo inputs (Java), or via program dependence graph isomorphism (neural decompilation) (Harrand et al., 2019, Katz et al., 2019).
Other Dimensions: Variable/type recovery accuracy, identifier naming fidelity, code coverage equivalence (CER), recompilation success rate (RSR), and LLM-as-Judge human-centric code assessment (Liu et al., 10 Mar 2025, Gao et al., 16 May 2025, Zhou, 24 Jul 2025).

Evaluations frequently incorporate ablations, e.g., end-to-end vs. step-by-step alignment, CFG/data-mapping prompt augmentation, or reward schemes for structure and identifier phases (Feng et al., 2024, Liu et al., 10 Mar 2025, Tan et al., 26 Sep 2025).

3. Notable Benchmark Outcomes and Comparative Analysis

Benchmarking reveals pronounced advances in LLM-driven decompilation, especially with structured context, fine-grained alignment, and quality-aware fine-tuning:

Model/Method	Re-Exec (HumanEval, O0–O3)	Notable Metric/Comment
GPT-4o (prompt)	14.6%	Baseline general-purpose LLM (Feng et al., 2024)
DeepSeek-chat	6.6%	LLM baseline
llm4decompile-6.7B	47.68%	Prior SOTA
+FAE	52.28%	Statement-level alignment (Feng et al., 2024)
+sc2dec	51.52%	Self-constructed context (Feng et al., 2024)
+FAE+sc2dec	55.03%	SOTA with both alignment/context (Feng et al., 2024)
ReF Decompile	61.43% (SOTA)	Relabeling + Function Call strategies (Feng et al., 17 Feb 2025)
SK2Decompile	69.00% (SOTA)	Two-phase RL with structure+identifier (Tan et al., 26 Sep 2025)
SALT4Decompile	TCP=70.4% (SOTA)	Source-level logic tree abstraction, robust to obfuscation (Wang et al., 18 Sep 2025)
SLaDe (x86/O3)	66%	Transformer + type inference (ExeBench) (Armengol-Estapé et al., 2023)
CodeInverter Suite	41.5% (CIM-1.3B)	CFG/data-mapping augmented prompts (HumanEval-64) (Liu et al., 10 Mar 2025)
Ghidra (rule-based)	20.12%	Industry decompiler

LLM-based tools that incorporate explicit control-flow, source-binary alignment, or dual-phase RL pipelines consistently outperform vanilla LLMs and commercial decompilers in functionality and readability. However, commercial engines (Hex-Rays, Ghidra) still lead in coverage equivalence and recompliation robustness for production binaries (Gao et al., 16 May 2025).

4. Technical Innovations in Benchmark-Driven Decompilation

Benchmarked research has introduced several algorithmic and workflow innovations:

Self-Constructed Context Decompilation (sc²dec): In-context exemplars generated from first-pass model outputs and recompiled/disassembled code to bridge compiler/version gaps (Feng et al., 2024).
Fine-grained Alignment Enhancement (FAE): DWARF-driven pairing of assembly and source code statements, enabling stepwise alignment objectives in neural fine-tuning (Feng et al., 2024).
Relabeling and Function Call Preprocessing: Explicit label mapping for control-flow and variable inference, recovering jump targets and data constants (Feng et al., 17 Feb 2025).
Two-phase RL Pipelines: Decoupling structure recovery (generic placeholders, compiler-driven rewards) and identifier naming (semantic embedding similarity rewards) (Tan et al., 26 Sep 2025).
Source-level Abstract Logic Tree (SALT): Hierarchical abstraction of stable control-flow features, including loop back-edges, providing LLM-guidance for semantic recovery and resilience to obfuscation (Wang et al., 18 Sep 2025).
CFG/Data Mapping Prompt Engineering: Augmenting input with control-flow graphs and variable-table mappings for improved function recovery and readability (Liu et al., 10 Mar 2025).
D-SCORE and CodeAlign Metrics: Integrated symbolic, semantic, and readability scoring enforcing accuracy before readability feedback, and fine-grained SSA instruction-level alignment (Zou et al., 11 Jun 2025, Dramko et al., 8 Jan 2025).

5. Language and Domain Coverage

Decompile-Bench-Eval spans traditional binary formats and domains:

C/C++ and Assembly (x86, ARM, MIPS, x86-64/32): Extensive coverage including ExeBench, HumanEval, MBPP, and GitHub2025 functions with source-binary mapping across optimization levels (Armengol-Estapé et al., 2023, Tan et al., 19 May 2025).
Java Bytecode: Syntactic correctness, distortion, and semantic equivalence measured across a large multi-project corpus with multiple JVM compilers (Harrand et al., 2019).
Rust: Feature-oriented benchmarks (generics, traits, concurrency, error semantics) at debug and release builds, showing marked fidelity drops under optimization (Zhou, 24 Jul 2025).
WebAssembly: DecFuzzer synthetic programs, PolyBenchC/CHStone, with comparison to wasm2c, w2c2, wasm-decompile, Ghidra, RetDec; metrics include Halstead effort and AST similarity (Wu et al., 2024).
Ethereum Smart Contracts: EVM and Yul pipelines, emphasizing block completeness, imprecision, and external-call/event coverage; advanced static-analysis context (Shrnkr) significantly outperforms prior symbolic/execution engines (Lagouvardos et al., 2024).
Quantum Circuits: Genetic-programming decompilation of OpenQASM to high-level Qiskit with functional, structural, and interpretability scoring (Xie et al., 11 Apr 2025).

6. Limitations, Open Challenges, and Recommendations

Benchmark-driven methodology exposes remaining challenges:

Scale and Context: Most benchmarks focus on per-function recovery; multi-file and large-scale whole-program decompilation is prohibitively costly for current models (Tan et al., 19 May 2025).
Optimization and Semantic Loss: Aggressive compiler optimizations (inlining, flattening, dead-code elimination) systematically obscure high-level semantics, particularly in Rust and C binaries (Zhou, 24 Jul 2025).
Type and Name Recovery: Nontrivial loss of identifier fidelity and type structure persists, especially in large, complex, or deeply optimized code; explicit type inference and data-mapping augmentation are ongoing research (Armengol-Estapé et al., 2023, Liu et al., 10 Mar 2025).
Human Judgment and Metric Correlation: While metrics such as R2I and LLM-as-Judge show high agreement with expert evaluations, approximation and domain-specific proxies remain—further user studies and hybrid metrics are recommended (Gao et al., 16 May 2025, Tan et al., 26 Sep 2025, Wang et al., 18 Sep 2025).
Obfuscation and Robustness: Control-flow obfuscation (bogus CF, flattening, instruction substitution) dramatically degrades standard methods; explicit logic tree modeling and control-flow abstraction mitigate effects (Wang et al., 18 Sep 2025).
Legal and Licensing Constraints: Only permissively licensed code included; closed/proprietary binaries excluded, potentially limiting relevance for some applications (Tan et al., 19 May 2025).

Prominent recommendations for future Decompile-Bench-Eval suites include richer feature categories (cross-language, concurrency, obfuscation), broader metric sets (semantic symbolic equivalence, panic-path recovery, trait preservation), and standardized artifact releases for reproducibility and broad adoption (Harrand et al., 2019, Wu et al., 2024, Zhou, 24 Jul 2025, Tan et al., 19 May 2025).

7. Impact and Future Directions

Decompile-Bench-Eval has become the bedrock of comparative decompiler research, enabling reproducible, multi-dimensional assessment of symbolic, analytic, neural, and LLM-based systems. Modern findings indicate that context-driven, structurally aligned, and RL-optimized pipelines push the frontier of decompilation, yielding dramatic advances in both functional fidelity and code readability. The modularity and extensibility of these frameworks continue to support emerging domains (quantum, Rust, WASM, EVM), algorithmic improvements (two-phase RL, hybrid neural-symbolic), and standardization of evaluation protocols.

Ongoing research will likely address multi-function/project recovery, richer type/name/semantic abstraction, deeper metric instrumentation, enhanced obfuscation resilience, and improved user-centric assessments to further advance the reliability, accuracy, and utility of decompilation science.

Markdown Upgrade to Chat

References (16)

Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement (2024)

The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs (2025)

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation (2025)

DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios (2025)

The Strengths and Behavioral Quirks of Java Bytecode Decompilers (2019)

Decompiling Rust: An Empirical Study of Compiler Optimizations and Reverse Engineering Challenges (2025)

The Incredible Shrinking Context... in a Decompiler Near You (2024)

Is This the Same Code? A Comprehensive Study of Decompilation Techniques for WebAssembly Binaries (2024)

DeQompile: quantum circuit decompilation using genetic programming for explainable quantum architecture search (2025)

10.

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly (2023)

11.

SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin (2025)

12.

Towards Neural Decompilation (2019)

13.

ReF Decompile: Relabeling and Function Call Enhanced Decompile (2025)

14.

SALT4Decompile: Inferring Source-level Abstract Logic Tree for LLM-Based Binary Decompilation (2025)

15.

D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning (2025)

16.

Fast, Fine-Grained Equivalence Checking for Neural Decompilers (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decompile-Bench-Eval.