- The paper presents a novel three-stage CTF pipeline that filters 100 million raw binary functions into two million high-quality binary-source function pairs.
- The benchmark significantly boosts LLM decompilation performance, achieving over 20% improvement in re-executability and enhanced code readability metrics.
- It also demonstrates versatility in binary analysis tasks while addressing challenges in data quality, ethical use, and computational resources.
This paper introduces Decompile-Bench, a novel, large-scale benchmark designed to advance the field of LLM-based binary decompilation. The core problem addressed is the lack of comprehensive datasets providing accurate binary-source function pairs derived from real-world, release-level binaries. Existing datasets are often limited in scale, synthetic, or only provide partial (fragment-level) mappings, hindering the training and evaluation of modern LLM decompilers.
Decompile-Bench comprises two million high-quality function-level binary-source pairs. This dataset was condensed from a raw collection of 100 million binary functions compiled from 450GB of permissively licensed GitHub projects. To create this benchmark, the authors developed a three-stage pipeline called the Compile-Trace-Filter (CTF) framework:
- Automatic Compilation: They forked the Clang compiler to enforce consistent optimization levels (-O0 to -O3) and include DWARF debug information (-g), addressing issues where projects ignore standard environment flags. The pipeline also includes automated dependency parsing and installation.
- Trace Binary-Source Correspondence: A "Source-Trace" algorithm is proposed to accurately map compiled binary functions back to their complete original source functions. It leverages DWARF information to find source lines corresponding to a binary function's code segment, and then uses Tree-sitter to parse the source project and identify the full function body containing those lines. The candidate source function with the largest overlap of lines from the binary's DWARF segment is selected as the match. This method helps overcome fragmentation and reordering caused by compiler optimizations and inlining.
- Filter Data: A rigorous three-stage filtering process is applied to the initial 100 million raw pairs:
- Project-scope filter: Removes source functions defined in system or dependency headers.
- In-binary deduplicator: For template instantiations or other cases where multiple binary functions map to the same source function within a single binary, only the best match (largest DWARF-segment overlap) is kept.
- Cross-binary deduplicator: Uses MinHash-LSH to eliminate near-duplicate source functions and assemblies across different binaries.
This filtering process is shown to be critical, reducing the raw 100 million pairs to two million high-quality ones. Analysis shows that the filtered data (Decompile-Bench) has significantly higher Cyclomatic Complexity and Halstead Difficulty compared to the unfiltered raw data and also compared to the executable subset of a previous benchmark, ExeBench, indicating it is more representative of real-world code complexity.
For evaluation, the paper introduces Decompile-Bench-Eval. This evaluation suite is designed to be leakage-resistant and includes:
- Manually translated C/C++ versions of the widely used HumanEval and MBPP code-completion benchmarks.
- A new dataset, GitHub2025, consisting of binaries compiled from permissively licensed GitHub repositories published after 2025, ensuring a fresh, unseen dataset for testing.
The evaluation uses several metrics common in decompilation research:
- Re-Executability: Measures if the decompiled code, when recompiled, produces the same output as the original source code on a given test set. This assesses functional correctness.
- R2I (Relative Readability Index): A metric for quantitatively evaluating the readability of decompiled C code based on AST features.
- Edit Similarity: Measures the Levenshtein distance-based similarity between the decompiled output and the original source code.
- Additional metrics explored in the appendix include Embedding Similarity (using CodeSage) and CodeBLEU, which incorporates syntactic and semantic matching.
Experiments fine-tuning the state-of-the-art LLM4Decompile-End model (a 1.3B parameter model pre-trained on ExeBench) on just 10% of the Decompile-Bench data (resulting in LLM4Decompile-DCBench) demonstrate significant improvements. Compared to the baseline LLM4Decompile-End, the model trained on Decompile-Bench shows an average improvement of over 20% in re-executability on the HumanEval and MBPP benchmarks. It also achieves substantially higher scores on R2I, Edit Similarity, Embedding Similarity, and CodeBLEU, especially on the real-world GitHub2025 dataset, highlighting the value of training on real-world data. Ablation studies confirm that the quality filtering is essential, as training on the unfiltered raw data leads to degraded performance.
Beyond decompilation training, the authors note that Decompile-Bench can also support other binary analysis tasks. As a preliminary demonstration, an embedding model trained on 10% of the data for binary-source search achieved a 27% recall@1 on the GitHub2025 evaluation set, competitive with state-of-the-art methods.
The paper acknowledges limitations, including resource constraints that prevented training on the full two million pairs or with larger models for this paper, the difficulty and cost of full project-scope decompilation, and the ethical/legal challenges related to using non-permissive data for training to improve real-world effectiveness.
In conclusion, Decompile-Bench is presented as the first large-scale, publicly available benchmark of real-world function-level binary-source pairs, addressing a critical need for advancing LLM-based decompilation. The CTF framework provides a practical pipeline for generating such data. The experimental results strongly suggest that training on Decompile-Bench leads to significantly improved performance in terms of both functional correctness (re-executability) and code readability, underscoring its potential to accelerate future research and development in this domain. The data and code are released publicly.