Decompile-Eval: Decompiler Benchmark
- Decompile-Eval is a unifying evaluation methodology that rigorously measures the functional correctness, readability, and usability of decompilers using standardized protocols.
- It defines core metrics such as re-executability rate and coverage equivalence, enabling quantitative comparisons among LLM-driven, neurosymbolic, and classical approaches.
- The benchmark suite integrates diverse datasets and best practices to ensure reproducible research and drive advancements in automated reverse engineering.
Decompile-Eval is a unifying evaluation methodology and benchmark suite formulated to rigorously assess the functional correctness, readability, and practical usability of automated decompilation systems across diverse programming languages, instruction sets, compiler optimizations, and decompilation paradigms. It is foundational in modern research on LLM-powered, neurosymbolic, and classical static-analysis-based decompilers, providing standardized metrics, datasets, and protocols that enable reproducible and quantitative comparison between tools and algorithms in academic and applied reverse engineering.
1. Definition and Scope
Decompile-Eval specifies both the evaluation taskāreconstructing high-level source code from compiled binaries or assembly such that the result is syntactically correct and semantically equivalentāand the protocol by which success is measured. Its influence extends to native binaries (C/C++, Rust), virtual machine code (WASM), DNN executables, and quantum circuits, and enables comparison of neural, rule-based, hybrid, and human-oriented decompiler workflows (Tan et al., 2024, Gao et al., 16 May 2025, Feng et al., 17 Feb 2025, Zhou, 24 Jul 2025, Tan et al., 26 Sep 2025, Wang et al., 18 Sep 2025).
2. Core Metrics and Evaluation Protocols
Decompile-Eval introduces and standardizes a family of metrics that directly quantify decompilation quality.
- Re-Executability Rate ():
$R_{exec} = \frac{\#\;\mathrm{outputs\ that\ compile\ %%%%1%%%%\ pass\ all\ unit\ tests}}{\mathrm{total\;samples}}$
This is the primary functional metric: correct decompilation requires not just recompilation but behavioral equivalence.
- Re-Compilability Rate ():
Used to separate syntactic from semantic errors.
- Coverage Equivalence Rate (CER):
Runtime side-effect comparison: proportion of decompiled functions whose execution under test corpus matches the branch coverage profile of the original (Gao et al., 16 May 2025).
- Readability Metrics:
- Relative Readability Index (R2I): Weighted function of AST-derived features normalized to , indicating idiomatic similarity to human-written code (Tan et al., 26 Sep 2025, Tan et al., 19 May 2025).
- Edit Similarity: , with the Levenshtein distance between decompiled and reference code.
- Human or LLM-Judge Elo Ratings: Pairwise code comparison on understandability/structure, aggregated into rankings (Gao et al., 16 May 2025).
- Advanced Measures (context-specific):
- AST Edit Distance and Cyclomatic Complexity Similarity for functional/structural preservation (notably in WASM domain) (She et al., 2024).
- Quantitative program recovery (token/program accuracy) in early neural models (Fu et al., 2019).
3. Datasets, Task Construction, Benchmarks
Decompile-Eval systematically incorporates a range of test data varying in complexity and realism:
| Dataset | Domain(s) | Function Count | Characteristics |
|---|---|---|---|
| HumanEval | C/C++, Rust | 164 | Algorithmic, unit-tested, handpicked |
| MBPP | C/C++ | 923 | Algorithmic, multi-case test harness |
| ExeBench | x86, ARM, etc. | 4000+ | System-level, diverse optimization |
| GitHub2025 | C, C++ | 60,000 | Real-world OSS, post-2025, unseen |
| Real-world OSS | C, C++ | 23,400 | Fuzz-extracted, multi-O-level |
Each source is compiled at multiple optimization levels (O0āO3), often with debug info for alignment, and evaluated in both single-function and multi-file settings. For WASM, HumanEval-X and MBXP adapt the methodology to the C++ and web domains (Fang et al., 2024).
4. Methodological Best Practices
Best practices for Decompile-Eval, as established in leading studies, include:
- End-to-End Harnessing: Automated routines compile, disassemble, decompile, recompile the result, and run all reference test cases with sandboxing (Tan et al., 2024, Gao et al., 16 May 2025).
- Strict Partitioning: Rigidly separate training, retrieval, and evaluation sets to mitigate data leakage.
- Optimized Compilation Matching: Always apply the same toolchain, flags, and optimization level as the ground-truth binary in both training and evaluation.
- Robust Error Taxonomies: Catalog failure types (syntax, assertion, type, runtime, coverage misses) for actionable diagnosis.
- Human Study Integration: Employ controlled user studies or LLM-based judges for subjective readability when appropriate.
- Configurable Pipelines: Expose fine-grained toggles (e.g., control-flow handling, cast elision) for analyst-in-the-loop experiments (Enders et al., 2022).
5. Impact on System Design and Research Progress
Decompile-Eval has catalyzed the development and objective assessment of fundamentally new decompiler designs:
- LLM-Driven and Hybrid Systems: LLM4Decompile, SK²Decompile, ReF Decompile, SALT4Decompile, and context-augmented approaches (ICL4Decomp, sc²dec/FAE) have shown >16ā70% relative gains in vs. previous SOTA (Tan et al., 2024, Tan et al., 26 Sep 2025, Feng et al., 17 Feb 2025, Wang et al., 18 Sep 2025, Feng et al., 2024, Wang et al., 3 Nov 2025).
- Neurosymbolic and Two-Stage Pipelines: Integration of static analysis with CoT prompting or explicit logic-tree extraction (StackSight, SALT4Decompile) improves both correctness and code interpretability, especially for WASM and obfuscated binaries (Fang et al., 2024, Wang et al., 18 Sep 2025).
- Robustness Evaluation: Obfuscation, cross-architecture, and quantized/optimized code scenarios inform generalizability and are incorporated in Decompile-Eval (Wang et al., 18 Sep 2025, Li et al., 8 Sep 2025, She et al., 2024).
- Human and Automated Readability Measures: Multi-dimensional metrics ensure research aligns tool development with both functional and analyst-centric needs (Gao et al., 16 May 2025, Tan et al., 19 May 2025, Enders et al., 2022).
6. Limitations, Extensions, and Controversies
Key constraints and open issues in current Decompile-Eval practice include:
- Context Sensitivity: Scaling to global compilation, link-time optimizations, or highly inlined code remains limited; current toolchains focus on function-level equivalence (Wang et al., 3 Nov 2025, Tan et al., 19 May 2025).
- Human Studies vs. Automated Metrics: Some studies suggest subjective readability diverges from semantic correctness, motivating hybrid evaluation schemes (Gao et al., 16 May 2025, Enders et al., 2022).
- Architecture and ABI Coverage: While x86 and C predominate, ongoing work expands Decompile-Eval to ARM, AArch64, RISC-V, Rust, WASM, DNN executables, and quantum circuits (Zhou, 24 Jul 2025, She et al., 2024, Li et al., 8 Sep 2025, Xie et al., 11 Apr 2025).
- Optimizations and Obfuscation Handling: Aggressive compiler optimizations and adversarial obfuscation introduce semantic drift and evaluation ambiguity; advanced symbolic/static analysis and hybrid dynamic/LLM strategies are under active exploration (Wang et al., 18 Sep 2025, Zhou, 24 Jul 2025).
- Potential for Hallucinations: Pure LLM-based approaches can invent logic not present in the binary; alignment strategies (debug info, logic trees, explicit context) are considered best practice (Feng et al., 17 Feb 2025, Tan et al., 26 Sep 2025, Fang et al., 2024).
- Reproducibility and Data Leakage: Careful curation of evaluation sets (post-date training, debug info, no overlap) is essential to valid comparison (Tan et al., 19 May 2025).
7. Outlook and Future Directions
Decompile-Eval is continually evolving to address the needs of emerging domains (e.g., transformer-based DNNs, quantum circuits), new programming languages, and more realistic attack/defense scenarios. Anticipated directions include:
- Closed-Loop Self-Debugging and Repair Loops: Integrating LLM error correction via re-compilation feedback (Wang et al., 18 Sep 2025, Fu et al., 2019).
- Semantic-Aware Metric Development: Moving beyond token or AST similarity to richer models of behavioral and dataflow equivalence (Zhou, 24 Jul 2025, Li et al., 8 Sep 2025).
- Multi-Function and Project-Scale Decompilation: Handling global program properties, linking, and cross-file optimization artifacts in the evaluation loop.
- Human-in-the-Loop and Analyst-Tunable Workflows: Configurability and human-centric evaluation to bridge practical reverse engineering with automated pipelines (Enders et al., 2022, Gao et al., 16 May 2025).
By anchoring evaluation in rigorous, quantitative, and reproducible protocols, Decompile-Eval has become a de facto standard and reference point for the assessment and advancement of modern decompilation research and tooling.