Decompile-Bench: Million-Scale Binary Functions

Updated 20 March 2026

Decompile-Bench is a large-scale dataset of 2 million precisely paired binary and source functions for C/C++, curated from real-world permissively licensed projects.
Its robust Compile-Trace-Filter pipeline accurately matches functions, removes noise, and deduplicates nearly 100 million raw binary functions to yield high-fidelity pairs.
The dataset offers comprehensive splits and evaluation benchmarks for LLM decompilation, supporting sequence-to-sequence fine-tuning and contrastive learning applications.

Decompile-Bench is the first open-source, million-scale corpus of precisely paired binary and source functions for C and C++, systematically curated from real-world permissively licensed software with rigorous provenance, deduplication, and function boundary recovery. It is designed for empirical study and advancement of LLM based binary decompilation, offering scale, fidelity, and evaluation benchmarks that address the limitations of synthetic or partial datasets in prior art (Tan et al., 19 May 2025).

1. Corpus Scope, Provenance, and Licensing

Decompile-Bench comprises 2,000,000 binary–source function pairs, distilled from an initial collection of approximately 100 million binary functions (≈450 GB of compiled artifacts) (Tan et al., 19 May 2025). The underlying source code is drawn from C and C++ repositories in the “Stack V2” collection, selected strictly for permissive licensing (MIT, BSD, Apache 2.0 as detected via ScanCode/Blue Oak Council), nontriviality (at least one star and a valid CMakeLists.txt), and public availability.

All binaries are compiled directly from these public repositories under the original licenses, with non-permissive, commercial, or system/external code strictly excluded at both source and header dependency levels. This guarantees both legal clarity and ethical use for academic research. The dataset can be freely used for research under the terms of the original permissive licenses.

2. Data Collection and Compile-Trace-Filter (CTF) Pipeline

The CTF pipeline ensures robust function-level matching and noise suppression across three orchestrated stages:

2.1 Automatic Compilation (“Compile”)

Clang is forked and patched to forcibly embed DWARF debug information (with -g) and employ one of four optimization levels (-O0 … -O3) on every invocation. All binaries are built using CMake-driven build systems, with missing dependencies resolved via single-shot LLM queries and recipes cached per-project. This robust environment, applied to 3,961 GitHub repositories, yields ≈85,000 binaries and ∼100 million raw binary functions.

2.2 Binary–Source Function Matching (“Trace”)

DWARF debugging data provides line-level mappings, but inlined and optimized code typically fragments or reorders these source line links. Decompile-Bench’s “Source-Trace” algorithm collects, for each binary function $f_b$ , the full set of DWARF-mapped source locations (func_segment). Using Tree-sitter, it retrieves all enclosing source functions $f_s(\ell)$ for every $\ell$ in func_segment and selects the candidate with maximal overlap in line numbers—restoring canonical function boundaries and grouping inlined elements.

2.3 Noise Removal (“Filter”)

Precise filtering is applied in three tiers:

Project-scope: discard any source not defined in the target repo (eliminates system/dependency headers, trivial getters/setters).
In-binary deduplication: for multiple binary functions mapping to the same source (e.g., template instantiations), retain only the function with largest DWARF overlap.
Cross-binary deduplication: apply MinHash-LSH over disassembled binary and corresponding source to remove near-duplicates globally.

This yields a final dataset of 2 million high-quality, project-rooted function pairs—retaining only approximately 2% of the initial function pool (Tan et al., 19 May 2025).

3. Dataset Splits and Evaluation Suite (Decompile-Bench-Eval)

Decompile-Bench provides canonical splits for training, validation, and test, based on explicit repository provenance and creation chronology.

Experimental protocol: In published results, 10% (200,000 pairs) are allocated for training, with all repositories published post-2025 (designated “GitHub2025”, 121 repos, ~60,000 functions) strictly held out from training and used exclusively for final test to preclude data leakage.
Decompile-Bench-Eval: The companion benchmark suite is constructed for rigorous, non-leaky, and interpretable evaluation. It comprises three disjoint sets, each compiled at all optimization levels:
1. HumanEval (C/C++): 164 C/C++ problems manually adapted from the Python HumanEval benchmark plus harnesses.
2. MBPP (C/C++): 200 C/C++ problems likewise hand-translated from the Python MBPP suite.
3. GitHub2025: 60,000 functions extracted via the CTF pipeline (with identical filtering), from repositories introduced after 2025.

4. Metrics and Performance Results

Multiple metrics are employed to measure decompilation quality: correctness, readability, and textual similarity. Key definitions and absolute results are as follows.

4.1 Re-Executability Rate (Functional Correctness)

Given a decompiled function $d$ , the function is "re-executable" if:

$\forall x \in T, \quad s(x) = d(x)$

where $s$ is the original source and $T$ is the test input set supplied by HumanEval or MBPP. The re-executability rate is:

$\text{Rate} = \frac{\text{number of functions passing all tests}}{\text{total functions}}$

Main results (averaged over optimization levels):

Model	HumanEval	MBPP
LLM4Decompile-End	16.22%	20.54%
+Fine-tune on DCBench	20.89%	24.93%
Relative improvement	+28.8%	+21.4%

4.2 Relative Readability Index (R2I)

R2I ℝ ∈ [0,1] quantifies readability via AST-derived features and learned weights; higher scores reflect superior structure, indentation, and identifiers:

LLM4Decompile-End: 60.47
LLM4Decompile-DCBench: 73.18 (+21% relative) (GitHub2025 average, O0–O3)

4.3 Edit Similarity

Defined as $1 - \frac{\text{LevenshteinDistance}(d,s)}{\max(\text{len}(d), \text{len}(s))}$ , measuring normalized edit proximity. On GitHub2025 (average): LLM4Decompile-End yields 21.57%; LLM4Decompile-DCBench, 29.51% (+36.8% relative).

4.4 Additional Metrics

Embedding similarity (CodeSage embeddings + cosine) and CodeBLEU (hybrid n-gram BLEU, AST subtree match, data-flow match) consistently show absolute improvements of ≈15–20% when LLM decompilers are fine-tuned on Decompile-Bench.

A plausible implication is that representational quality and functional recoverability of LLM decompilers benefit more from real-world, large-scale pairing than from synthetic or line-level benchmarks.

5. Data Format, Availability, and Applications

Each example in Decompile-Bench encodes:

asm: disassembled binary function (with DWARF-resolved symbols removed),
src: original C/C++ function (full signature and body),
project: repository name, optimization, and relevant build metadata.

Public access is provided via HuggingFace (https://huggingface.co/datasets/LLM4Binary/decompile-bench) and source/metadata/Eval suite via GitHub (https://github.com/albertan017/LLM4Decompile).

Recommended research uses include:

Sequence-to-sequence fine-tuning of any LLM, encoder–decoder, or transformer-based model for binary→source or source retrieval,
Contrastive learning for embedding alignment and retrieval tasks,
Method development for function boundary recovery, inlining analysis, or noise-robust source matching.

6. Ethics and Licensing Considerations

All content is governed by its originating permissive license. No non-permissive or commercial code is included; commercial binaries—often obfuscated—are excluded by design and not suitable for standard decompilation research. The dataset is strictly intended and recommended for academic research in fields including binary decompilation, reverse engineering, and program understanding (Tan et al., 19 May 2025).

7. Relation to Prior Decompilation Benchmarks and Datasets

Decompile-Bench is distinguished from earlier datasets by both its scale and methodology. Prior datasets, such as those used for Java bytecode decompiler evaluation (Harrand et al., 2019), typically cover orders of magnitude fewer code units (e.g., ≈2,000–25,000 classes or functions) and are often constrained to syntactic correctness or partial semantic equivalence. Recent benchmarks like DecompileBench (Gao et al., 16 May 2025) for C/C++ focus on runtime-aware validation (Coverage Equivalence Rate) and LLM-based code understanding assessment over ~23,400 functions. In contrast, Decompile-Bench provides comprehensive binary–source alignment using real-world C/C++ code, incorporates robust inlining and optimization handling, and spans two million function pairs. This suggests that Decompile-Bench is the current largest resource enabling both large-scale model training and practical, leakage-resistant evaluation for LLM-based decompilation research.

References:

(Tan et al., 19 May 2025): Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation (Gao et al., 16 May 2025): DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios (Harrand et al., 2019): The Strengths and Behavioral Quirks of Java Bytecode Decompilers