Decompile-Bench: Million-Scale Binary Functions
- Decompile-Bench is a large-scale dataset of 2 million precisely paired binary and source functions for C/C++, curated from real-world permissively licensed projects.
- Its robust Compile-Trace-Filter pipeline accurately matches functions, removes noise, and deduplicates nearly 100 million raw binary functions to yield high-fidelity pairs.
- The dataset offers comprehensive splits and evaluation benchmarks for LLM decompilation, supporting sequence-to-sequence fine-tuning and contrastive learning applications.
Decompile-Bench is the first open-source, million-scale corpus of precisely paired binary and source functions for C and C++, systematically curated from real-world permissively licensed software with rigorous provenance, deduplication, and function boundary recovery. It is designed for empirical study and advancement of LLM based binary decompilation, offering scale, fidelity, and evaluation benchmarks that address the limitations of synthetic or partial datasets in prior art (Tan et al., 19 May 2025).
1. Corpus Scope, Provenance, and Licensing
Decompile-Bench comprises 2,000,000 binaryāsource function pairs, distilled from an initial collection of approximately 100 million binary functions (ā450āÆGB of compiled artifacts) (Tan et al., 19 May 2025). The underlying source code is drawn from C and C++ repositories in the āStack V2ā collection, selected strictly for permissive licensing (MIT, BSD, Apache 2.0 as detected via ScanCode/Blue Oak Council), nontriviality (at least one star and a valid CMakeLists.txt), and public availability.
All binaries are compiled directly from these public repositories under the original licenses, with non-permissive, commercial, or system/external code strictly excluded at both source and header dependency levels. This guarantees both legal clarity and ethical use for academic research. The dataset can be freely used for research under the terms of the original permissive licenses.
2. Data Collection and Compile-Trace-Filter (CTF) Pipeline
The CTF pipeline ensures robust function-level matching and noise suppression across three orchestrated stages:
2.1 Automatic Compilation (āCompileā)
Clang is forked and patched to forcibly embed DWARF debug information (with -g) and employ one of four optimization levels (-O0 ⦠-O3) on every invocation. All binaries are built using CMake-driven build systems, with missing dependencies resolved via single-shot LLM queries and recipes cached per-project. This robust environment, applied to 3,961 GitHub repositories, yields ā85,000 binaries and ā¼100 million raw binary functions.
2.2 BinaryāSource Function Matching (āTraceā)
DWARF debugging data provides line-level mappings, but inlined and optimized code typically fragments or reorders these source line links. Decompile-Benchās āSource-Traceā algorithm collects, for each binary function , the full set of DWARF-mapped source locations (func_segment). Using Tree-sitter, it retrieves all enclosing source functions for every in func_segment and selects the candidate with maximal overlap in line numbersārestoring canonical function boundaries and grouping inlined elements.
2.3 Noise Removal (āFilterā)
Precise filtering is applied in three tiers:
- Project-scope: discard any source not defined in the target repo (eliminates system/dependency headers, trivial getters/setters).
- In-binary deduplication: for multiple binary functions mapping to the same source (e.g., template instantiations), retain only the function with largest DWARF overlap.
- Cross-binary deduplication: apply MinHash-LSH over disassembled binary and corresponding source to remove near-duplicates globally.
This yields a final dataset of 2 million high-quality, project-rooted function pairsāretaining only approximately 2% of the initial function pool (Tan et al., 19 May 2025).
3. Dataset Splits and Evaluation Suite (Decompile-Bench-Eval)
Decompile-Bench provides canonical splits for training, validation, and test, based on explicit repository provenance and creation chronology.
- Experimental protocol: In published results, 10% (200,000 pairs) are allocated for training, with all repositories published post-2025 (designated āGitHub2025ā, 121 repos, ~60,000 functions) strictly held out from training and used exclusively for final test to preclude data leakage.
- Decompile-Bench-Eval: The companion benchmark suite is constructed for rigorous, non-leaky, and interpretable evaluation. It comprises three disjoint sets, each compiled at all optimization levels:
- HumanEval (C/C++): 164 C/C++ problems manually adapted from the Python HumanEval benchmark plus harnesses.
- MBPP (C/C++): 200 C/C++ problems likewise hand-translated from the Python MBPP suite.
- GitHub2025: 60,000 functions extracted via the CTF pipeline (with identical filtering), from repositories introduced after 2025.
4. Metrics and Performance Results
Multiple metrics are employed to measure decompilation quality: correctness, readability, and textual similarity. Key definitions and absolute results are as follows.
4.1 Re-Executability Rate (Functional Correctness)
Given a decompiled function , the function is "re-executable" if:
where is the original source and is the test input set supplied by HumanEval or MBPP. The re-executability rate is:
Main results (averaged over optimization levels):
| Model | HumanEval | MBPP |
|---|---|---|
| LLM4Decompile-End | 16.22% | 20.54% |
| +Fine-tune on DCBench | 20.89% | 24.93% |
| Relative improvement | +28.8% | +21.4% |
4.2 Relative Readability Index (R2I)
R2I ā ā [0,1] quantifies readability via AST-derived features and learned weights; higher scores reflect superior structure, indentation, and identifiers:
LLM4Decompile-End: 60.47
- LLM4Decompile-DCBench: 73.18 (+21% relative) (GitHub2025 average, O0āO3)
4.3 Edit Similarity
Defined as , measuring normalized edit proximity. On GitHub2025 (average): LLM4Decompile-End yields 21.57%; LLM4Decompile-DCBench, 29.51% (+36.8% relative).
4.4 Additional Metrics
Embedding similarity (CodeSage embeddings + cosine) and CodeBLEU (hybrid n-gram BLEU, AST subtree match, data-flow match) consistently show absolute improvements of ā15ā20% when LLM decompilers are fine-tuned on Decompile-Bench.
A plausible implication is that representational quality and functional recoverability of LLM decompilers benefit more from real-world, large-scale pairing than from synthetic or line-level benchmarks.
5. Data Format, Availability, and Applications
Each example in Decompile-Bench encodes:
asm: disassembled binary function (with DWARF-resolved symbols removed),src: original C/C++ function (full signature and body),project: repository name, optimization, and relevant build metadata.
Public access is provided via HuggingFace (https://huggingface.co/datasets/LLM4Binary/decompile-bench) and source/metadata/Eval suite via GitHub (https://github.com/albertan017/LLM4Decompile).
Recommended research uses include:
- Sequence-to-sequence fine-tuning of any LLM, encoderādecoder, or transformer-based model for binaryāsource or source retrieval,
- Contrastive learning for embedding alignment and retrieval tasks,
- Method development for function boundary recovery, inlining analysis, or noise-robust source matching.
6. Ethics and Licensing Considerations
All content is governed by its originating permissive license. No non-permissive or commercial code is included; commercial binariesāoften obfuscatedāare excluded by design and not suitable for standard decompilation research. The dataset is strictly intended and recommended for academic research in fields including binary decompilation, reverse engineering, and program understanding (Tan et al., 19 May 2025).
7. Relation to Prior Decompilation Benchmarks and Datasets
Decompile-Bench is distinguished from earlier datasets by both its scale and methodology. Prior datasets, such as those used for Java bytecode decompiler evaluation (Harrand et al., 2019), typically cover orders of magnitude fewer code units (e.g., ā2,000ā25,000 classes or functions) and are often constrained to syntactic correctness or partial semantic equivalence. Recent benchmarks like DecompileBench (Gao et al., 16 May 2025) for C/C++ focus on runtime-aware validation (Coverage Equivalence Rate) and LLM-based code understanding assessment over ~23,400 functions. In contrast, Decompile-Bench provides comprehensive binaryāsource alignment using real-world C/C++ code, incorporates robust inlining and optimization handling, and spans two million function pairs. This suggests that Decompile-Bench is the current largest resource enabling both large-scale model training and practical, leakage-resistant evaluation for LLM-based decompilation research.
References:
(Tan et al., 19 May 2025): Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation (Gao et al., 16 May 2025): DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios (Harrand et al., 2019): The Strengths and Behavioral Quirks of Java Bytecode Decompilers