Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decompile-Eval: Decompiler Benchmark

Updated 21 January 2026
  • Decompile-Eval is a unifying evaluation methodology that rigorously measures the functional correctness, readability, and usability of decompilers using standardized protocols.
  • It defines core metrics such as re-executability rate and coverage equivalence, enabling quantitative comparisons among LLM-driven, neurosymbolic, and classical approaches.
  • The benchmark suite integrates diverse datasets and best practices to ensure reproducible research and drive advancements in automated reverse engineering.

Decompile-Eval is a unifying evaluation methodology and benchmark suite formulated to rigorously assess the functional correctness, readability, and practical usability of automated decompilation systems across diverse programming languages, instruction sets, compiler optimizations, and decompilation paradigms. It is foundational in modern research on LLM-powered, neurosymbolic, and classical static-analysis-based decompilers, providing standardized metrics, datasets, and protocols that enable reproducible and quantitative comparison between tools and algorithms in academic and applied reverse engineering.

1. Definition and Scope

Decompile-Eval specifies both the evaluation task—reconstructing high-level source code from compiled binaries or assembly such that the result is syntactically correct and semantically equivalent—and the protocol by which success is measured. Its influence extends to native binaries (C/C++, Rust), virtual machine code (WASM), DNN executables, and quantum circuits, and enables comparison of neural, rule-based, hybrid, and human-oriented decompiler workflows (Tan et al., 2024, Gao et al., 16 May 2025, Feng et al., 17 Feb 2025, Zhou, 24 Jul 2025, Tan et al., 26 Sep 2025, Wang et al., 18 Sep 2025).

2. Core Metrics and Evaluation Protocols

Decompile-Eval introduces and standardizes a family of metrics that directly quantify decompilation quality.

  • Re-Executability Rate (RexecR_{exec}):

$R_{exec} = \frac{\#\;\mathrm{outputs\ that\ compile\ %%%%1%%%%\ pass\ all\ unit\ tests}}{\mathrm{total\;samples}}$

This is the primary functional metric: correct decompilation requires not just recompilation but behavioral equivalence.

  • Re-Compilability Rate (RcompR_{comp}):

Rcomp=#ā€…ā€ŠoutputsĀ thatĀ compiletotalā€…ā€ŠsamplesR_{comp} = \frac{\#\;\mathrm{outputs\ that\ compile}}{\mathrm{total\;samples}}

Used to separate syntactic from semantic errors.

  • Coverage Equivalence Rate (CER):

Runtime side-effect comparison: proportion of decompiled functions whose execution under test corpus matches the branch coverage profile of the original (Gao et al., 16 May 2025).

  • Readability Metrics:
    • Relative Readability Index (R2I): Weighted function of AST-derived features normalized to [0,1][0,1], indicating idiomatic similarity to human-written code (Tan et al., 26 Sep 2025, Tan et al., 19 May 2025).
    • Edit Similarity: 1āˆ’Lev(d,s)/max⁔(∣d∣,∣s∣)1 - \mathrm{Lev}(d, s)/\max(|d|, |s|), with Lev\mathrm{Lev} the Levenshtein distance between decompiled and reference code.
    • Human or LLM-Judge Elo Ratings: Pairwise code comparison on understandability/structure, aggregated into rankings (Gao et al., 16 May 2025).
  • Advanced Measures (context-specific):
    • AST Edit Distance and Cyclomatic Complexity Similarity for functional/structural preservation (notably in WASM domain) (She et al., 2024).
    • Quantitative program recovery (token/program accuracy) in early neural models (Fu et al., 2019).

3. Datasets, Task Construction, Benchmarks

Decompile-Eval systematically incorporates a range of test data varying in complexity and realism:

Dataset Domain(s) Function Count Characteristics
HumanEval C/C++, Rust 164 Algorithmic, unit-tested, handpicked
MBPP C/C++ 923 Algorithmic, multi-case test harness
ExeBench x86, ARM, etc. 4000+ System-level, diverse optimization
GitHub2025 C, C++ 60,000 Real-world OSS, post-2025, unseen
Real-world OSS C, C++ 23,400 Fuzz-extracted, multi-O-level

Each source is compiled at multiple optimization levels (O0–O3), often with debug info for alignment, and evaluated in both single-function and multi-file settings. For WASM, HumanEval-X and MBXP adapt the methodology to the C++ and web domains (Fang et al., 2024).

4. Methodological Best Practices

Best practices for Decompile-Eval, as established in leading studies, include:

  • End-to-End Harnessing: Automated routines compile, disassemble, decompile, recompile the result, and run all reference test cases with sandboxing (Tan et al., 2024, Gao et al., 16 May 2025).
  • Strict Partitioning: Rigidly separate training, retrieval, and evaluation sets to mitigate data leakage.
  • Optimized Compilation Matching: Always apply the same toolchain, flags, and optimization level as the ground-truth binary in both training and evaluation.
  • Robust Error Taxonomies: Catalog failure types (syntax, assertion, type, runtime, coverage misses) for actionable diagnosis.
  • Human Study Integration: Employ controlled user studies or LLM-based judges for subjective readability when appropriate.
  • Configurable Pipelines: Expose fine-grained toggles (e.g., control-flow handling, cast elision) for analyst-in-the-loop experiments (Enders et al., 2022).

5. Impact on System Design and Research Progress

Decompile-Eval has catalyzed the development and objective assessment of fundamentally new decompiler designs:

6. Limitations, Extensions, and Controversies

Key constraints and open issues in current Decompile-Eval practice include:

7. Outlook and Future Directions

Decompile-Eval is continually evolving to address the needs of emerging domains (e.g., transformer-based DNNs, quantum circuits), new programming languages, and more realistic attack/defense scenarios. Anticipated directions include:

  • Closed-Loop Self-Debugging and Repair Loops: Integrating LLM error correction via re-compilation feedback (Wang et al., 18 Sep 2025, Fu et al., 2019).
  • Semantic-Aware Metric Development: Moving beyond token or AST similarity to richer models of behavioral and dataflow equivalence (Zhou, 24 Jul 2025, Li et al., 8 Sep 2025).
  • Multi-Function and Project-Scale Decompilation: Handling global program properties, linking, and cross-file optimization artifacts in the evaluation loop.
  • Human-in-the-Loop and Analyst-Tunable Workflows: Configurability and human-centric evaluation to bridge practical reverse engineering with automated pipelines (Enders et al., 2022, Gao et al., 16 May 2025).

By anchoring evaluation in rigorous, quantitative, and reproducible protocols, Decompile-Eval has become a de facto standard and reference point for the assessment and advancement of modern decompilation research and tooling.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decompile-Eval.