Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 109 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation (2505.12668v1)

Published 19 May 2025 in cs.SE

Abstract: Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest-style benchmarks, synthetic binary-source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https://github.com/albertan017/LLM4Decompile

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a novel three-stage CTF pipeline that filters 100 million raw binary functions into two million high-quality binary-source function pairs.
  • The benchmark significantly boosts LLM decompilation performance, achieving over 20% improvement in re-executability and enhanced code readability metrics.
  • It also demonstrates versatility in binary analysis tasks while addressing challenges in data quality, ethical use, and computational resources.

This paper introduces Decompile-Bench, a novel, large-scale benchmark designed to advance the field of LLM-based binary decompilation. The core problem addressed is the lack of comprehensive datasets providing accurate binary-source function pairs derived from real-world, release-level binaries. Existing datasets are often limited in scale, synthetic, or only provide partial (fragment-level) mappings, hindering the training and evaluation of modern LLM decompilers.

Decompile-Bench comprises two million high-quality function-level binary-source pairs. This dataset was condensed from a raw collection of 100 million binary functions compiled from 450GB of permissively licensed GitHub projects. To create this benchmark, the authors developed a three-stage pipeline called the Compile-Trace-Filter (CTF) framework:

  1. Automatic Compilation: They forked the Clang compiler to enforce consistent optimization levels (-O0 to -O3) and include DWARF debug information (-g), addressing issues where projects ignore standard environment flags. The pipeline also includes automated dependency parsing and installation.
  2. Trace Binary-Source Correspondence: A "Source-Trace" algorithm is proposed to accurately map compiled binary functions back to their complete original source functions. It leverages DWARF information to find source lines corresponding to a binary function's code segment, and then uses Tree-sitter to parse the source project and identify the full function body containing those lines. The candidate source function with the largest overlap of lines from the binary's DWARF segment is selected as the match. This method helps overcome fragmentation and reordering caused by compiler optimizations and inlining.
  3. Filter Data: A rigorous three-stage filtering process is applied to the initial 100 million raw pairs:
    • Project-scope filter: Removes source functions defined in system or dependency headers.
    • In-binary deduplicator: For template instantiations or other cases where multiple binary functions map to the same source function within a single binary, only the best match (largest DWARF-segment overlap) is kept.
    • Cross-binary deduplicator: Uses MinHash-LSH to eliminate near-duplicate source functions and assemblies across different binaries.

This filtering process is shown to be critical, reducing the raw 100 million pairs to two million high-quality ones. Analysis shows that the filtered data (Decompile-Bench) has significantly higher Cyclomatic Complexity and Halstead Difficulty compared to the unfiltered raw data and also compared to the executable subset of a previous benchmark, ExeBench, indicating it is more representative of real-world code complexity.

For evaluation, the paper introduces Decompile-Bench-Eval. This evaluation suite is designed to be leakage-resistant and includes:

  • Manually translated C/C++ versions of the widely used HumanEval and MBPP code-completion benchmarks.
  • A new dataset, GitHub2025, consisting of binaries compiled from permissively licensed GitHub repositories published after 2025, ensuring a fresh, unseen dataset for testing.

The evaluation uses several metrics common in decompilation research:

  • Re-Executability: Measures if the decompiled code, when recompiled, produces the same output as the original source code on a given test set. This assesses functional correctness.
  • R2I (Relative Readability Index): A metric for quantitatively evaluating the readability of decompiled C code based on AST features.
  • Edit Similarity: Measures the Levenshtein distance-based similarity between the decompiled output and the original source code.
  • Additional metrics explored in the appendix include Embedding Similarity (using CodeSage) and CodeBLEU, which incorporates syntactic and semantic matching.

Experiments fine-tuning the state-of-the-art LLM4Decompile-End model (a 1.3B parameter model pre-trained on ExeBench) on just 10% of the Decompile-Bench data (resulting in LLM4Decompile-DCBench) demonstrate significant improvements. Compared to the baseline LLM4Decompile-End, the model trained on Decompile-Bench shows an average improvement of over 20% in re-executability on the HumanEval and MBPP benchmarks. It also achieves substantially higher scores on R2I, Edit Similarity, Embedding Similarity, and CodeBLEU, especially on the real-world GitHub2025 dataset, highlighting the value of training on real-world data. Ablation studies confirm that the quality filtering is essential, as training on the unfiltered raw data leads to degraded performance.

Beyond decompilation training, the authors note that Decompile-Bench can also support other binary analysis tasks. As a preliminary demonstration, an embedding model trained on 10% of the data for binary-source search achieved a 27% recall@1 on the GitHub2025 evaluation set, competitive with state-of-the-art methods.

The paper acknowledges limitations, including resource constraints that prevented training on the full two million pairs or with larger models for this paper, the difficulty and cost of full project-scope decompilation, and the ethical/legal challenges related to using non-permissive data for training to improve real-world effectiveness.

In conclusion, Decompile-Bench is presented as the first large-scale, publicly available benchmark of real-world function-level binary-source pairs, addressing a critical need for advancing LLM-based decompilation. The CTF framework provides a practical pipeline for generating such data. The experimental results strongly suggest that training on Decompile-Bench leads to significantly improved performance in terms of both functional correctness (re-executability) and code readability, underscoring its potential to accelerate future research and development in this domain. The data and code are released publicly.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com