Papers
Topics
Authors
Recent
Search
2000 character limit reached

GitHub2025 Benchmark

Updated 14 May 2026
  • GitHub2025 is a temporally curated evaluation dataset that assesses LLM-based binary-to-source decompilers using modern open-source C/C++ projects.
  • It employs a systematic preprocessing pipeline with DWARF-enhanced source-trace algorithms and deduplication to ensure accurate binary–source mappings.
  • The benchmark provides key metrics such as R2I and GPT-Judge scores to differentiate decompiler performance and support reproducible research.

GitHub2025 is a temporally curated evaluation dataset designed to assess generalization and robustness of decompilers, specifically LLM–based binary-to-source decompilers, on real-world software projects. The benchmark is constructed from permissively licensed, open-source C/C++ repositories created after the year 2025, ensuring strict separation from the training data corpus of prior datasets and mitigating data leakage. GitHub2025 is distributed as part of Decompile-Bench-Eval and referenced extensively in LLM decompilation literature for evaluating the structural correctness and identifier recovery capabilities of advanced decompilation systems (Tan et al., 19 May 2025, Tan et al., 26 Sep 2025).

1. Dataset Composition and Metadata Schema

GitHub2025 contains a systematically selected and processed subset of GitHub repositories representing recent real-world software. Each entry is encoded as a JSON object with comprehensive metadata, including repository name, URL, creation timestamp (post-2025-01-01), SPDX license (e.g., MIT, BSD, Apache 2.0), star count (≄1), primary language(s) (C and/or C++), CMake project indicator, uncompressed source size, commit hash, optimized binaries (per opt-level, with full DWARF debugging enabled), and function-pair mappings. The dataset explicitly excludes system headers, third-party submodules, and non-permissive or low-quality projects.

Field Type Description
repo_name string GitHub owner/repository (e.g., "owner/project")
repo_url string HTTPS Git URL
created_at string ISO 8601 creation date (>2025-01-01T00:00:00Z)
license string SPDX license (permissive)
stars integer Number of GitHub stars (≄1)
languages list Subset of ["C", "C++"]
has_cmakelists boolean Always true
size_bytes integer Uncompressed source-tree size
commit_hash string Exact commit used for build
binaries list Objects per opt-level (O0–O3, all DWARF-included)
functions_extracted integer Post-deduplication function-pair count
filters object Counts removed by each filter stage

This schema is designed to support exhaustive traceability and reproducibility of all contained binary–source mappings (Tan et al., 19 May 2025).

2. Selection and Preprocessing Pipeline

Repositories in GitHub2025 are required to satisfy the following criteria: (a) creation date strictly after 2025-01-01 (verified via GitHub API), (b) permissive license verified against accepted lists (e.g., Blue Oak Council, ScanCode), (c) C/C++ as primary language with at least one CMakeLists.txt file, and (d) minimum quality threshold (≄1 star). No restriction is imposed on application domain or project size.

Each project is compiled for x86_64 Linux using a forked Clang version (2025 commit), forcibly including DWARF debug symbols and cycling through all optimization levels (O0, O1, O2, O3) with the driver overriding any user flags. Dependencies are auto-resolved by parsing CMakeLists.txt find_package(...) directives and mapping them using a cached GPT-derived list, then re-invoking the build system (Tan et al., 19 May 2025).

The critical "Source-Trace Algorithm" associates binary functions to their true source using DWARF line references and Tree-sitter parsing to maximize line-segment overlap. The mapping fb→fsāˆ—f_b \rightarrow f_s^* is selected by scoring source candidates fsf_s via overlap with DWARF line segments:

score(fs)=∣Lines(fs)∩func_segment∣\mathrm{score}(f_s) = \lvert\mathrm{Lines}(f_s) \cap \mathrm{func\_segment}\rvert

fsāˆ—=arg⁔max⁔fsscore(fs)f_s^* = \arg\max_{f_s} \mathrm{score}(f_s)

Subsequently, functions not defined within the repository, in-binary duplicates, and near-duplicates across repositories are removed using project-scope filtering, overlap pruning, and MinHash-LSH deduplication (Tan et al., 19 May 2025). This pipeline ensures high-quality, one-to-one binary–source function correspondence.

3. Benchmark Scale and Structural Statistics

GitHub2025 comprises 121 repositories and approximately 60,000 function pairs after all selection and deduplication steps. Binaries are emitted for each optimization level per repository, yielding up to 484 unique binaries. The function pool exhibits the following language distribution: C ā‰ˆ 65%, C++ ā‰ˆ 35%; function lengths are typically 5–50 lines, as depicted in the length histogram (see Decompile-Bench Figure 3b).

Opt Level # Binaries # Functions
O0 121 15,300
O1 121 15,450
O2 121 15,620
O3 121 13,630
Total 484 ā‰ˆ60,000

A plausible implication is that this systematic design captures a broad spectrum of real-world idioms, control flows, and data-structure patterns, spanning utility functions, data manipulations, string/file I/O, and numerical routines (Tan et al., 26 Sep 2025).

4. Evaluation Protocol and Metrics

GitHub2025 is positioned as the principal real-world, temporally held-out evaluation suite for LLM-based decompilers. The evaluation protocol requires producing C code from stripped binaries (using IDA Pro pseudocode as an intermediate), ensuring that no decompiler has seen any constituent project or its dependencies in the pre-2025 training datasets. Data-leakage mitigation comprises both the temporal split Rtrain∩R2025=āˆ…\mathcal{R}_{\mathrm{train}} \cap \mathcal{R}_{2025} = \emptyset and subdirectory stripping to eliminate vendor, external, or submodule code (Tan et al., 19 May 2025).

Core evaluation metrics, as implemented in recent benchmarks (Tan et al., 26 Sep 2025), include:

  • R2I (Relative Readability Index): Quantifies structural and syntactic readability (0–1) using 31 AST features; customized to assign 0 to unparsable outputs and uses Psyche-C for header generation.
  • GPT-Judge: Evaluates identifier recovery quality (scale 1–5) via the GPT-5-mini model trained with a comparative, form-filling prompt.
  • Re-executability: Not computed on GitHub2025 but on other benchmarks; measures whether synthesized code can be successfully executed.

These metrics enable comparison across decompiler architectures and optimization levels, facilitating fine-grained diagnosis of limitations in structure recovery and naming.

5. Baselines and Empirical Results

GitHub2025 serves as a critical differentiator for state-of-the-art decompilers, especially when compared to contest- or synthetic-style benchmarks. All major systems are evaluated under identical pipelines:

On GitHub2025, SK²Decompile achieves substantially improved readability and naming metrics:

Model R2I (AVG) GPT-Judge (AVG)
GPT-5-mini 28.95 2.87
LLM4Decompile 47.95 2.62
Idioms 57.97 2.18
SK²Decompile 74.99 3.06

SK²Decompile’s two-phase process yields a 29.4% average R2I improvement over Idioms, substantiating the advantage of decoupling structure and identifier recovery, particularly on the diverse, real-world function pool in GitHub2025 (Tan et al., 26 Sep 2025).

6. Access, Distribution, and Reproducibility

The full GitHub2025 dataset is publicly available via the Decompile-Bench HuggingFace release and associated source/build scripts on GitHub. Access involves:

1
2
3
pip install datasets
from datasets import load_dataset
ds = load_dataset("LLM4Binary/decompile-bench-eval")

On-disk structure includes github2025/repos.jsonl for repository metadata, github2025/bin/<opt_level>/*.so for binaries, and github2025/mappings/ for extracted pairings. The precise compilation, line-tracing, and deduplication procedure can be exactly replicated using the releases and documented scripts (Tan et al., 19 May 2025).

A plausible implication is that such distribution standards, coupled with strict temporal exclusion, establish GitHub2025 as the canonical benchmark for future LLM decompiler generalization, eliminating prior confounding effects from source-code overlap or synthetic pipeline artifacts.


References:

  • [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation, (Tan et al., 19 May 2025)]
  • [SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin, (Tan et al., 26 Sep 2025)]

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GitHub2025.