GitHub2025 Benchmark
- GitHub2025 is a temporally curated evaluation dataset that assesses LLM-based binary-to-source decompilers using modern open-source C/C++ projects.
- It employs a systematic preprocessing pipeline with DWARF-enhanced source-trace algorithms and deduplication to ensure accurate binaryāsource mappings.
- The benchmark provides key metrics such as R2I and GPT-Judge scores to differentiate decompiler performance and support reproducible research.
GitHub2025 is a temporally curated evaluation dataset designed to assess generalization and robustness of decompilers, specifically LLMābased binary-to-source decompilers, on real-world software projects. The benchmark is constructed from permissively licensed, open-source C/C++ repositories created after the year 2025, ensuring strict separation from the training data corpus of prior datasets and mitigating data leakage. GitHub2025 is distributed as part of Decompile-Bench-Eval and referenced extensively in LLM decompilation literature for evaluating the structural correctness and identifier recovery capabilities of advanced decompilation systems (Tan et al., 19 May 2025, Tan et al., 26 Sep 2025).
1. Dataset Composition and Metadata Schema
GitHub2025 contains a systematically selected and processed subset of GitHub repositories representing recent real-world software. Each entry is encoded as a JSON object with comprehensive metadata, including repository name, URL, creation timestamp (post-2025-01-01), SPDX license (e.g., MIT, BSD, Apache 2.0), star count (ā„1), primary language(s) (C and/or C++), CMake project indicator, uncompressed source size, commit hash, optimized binaries (per opt-level, with full DWARF debugging enabled), and function-pair mappings. The dataset explicitly excludes system headers, third-party submodules, and non-permissive or low-quality projects.
| Field | Type | Description |
|---|---|---|
| repo_name | string | GitHub owner/repository (e.g., "owner/project") |
| repo_url | string | HTTPS Git URL |
| created_at | string | ISO 8601 creation date (>2025-01-01T00:00:00Z) |
| license | string | SPDX license (permissive) |
| stars | integer | Number of GitHub stars (ā„1) |
| languages | list | Subset of ["C", "C++"] |
| has_cmakelists | boolean | Always true |
| size_bytes | integer | Uncompressed source-tree size |
| commit_hash | string | Exact commit used for build |
| binaries | list | Objects per opt-level (O0āO3, all DWARF-included) |
| functions_extracted | integer | Post-deduplication function-pair count |
| filters | object | Counts removed by each filter stage |
This schema is designed to support exhaustive traceability and reproducibility of all contained binaryāsource mappings (Tan et al., 19 May 2025).
2. Selection and Preprocessing Pipeline
Repositories in GitHub2025 are required to satisfy the following criteria: (a) creation date strictly after 2025-01-01 (verified via GitHub API), (b) permissive license verified against accepted lists (e.g., Blue Oak Council, ScanCode), (c) C/C++ as primary language with at least one CMakeLists.txt file, and (d) minimum quality threshold (ā„1 star). No restriction is imposed on application domain or project size.
Each project is compiled for x86_64 Linux using a forked Clang version (2025 commit), forcibly including DWARF debug symbols and cycling through all optimization levels (O0, O1, O2, O3) with the driver overriding any user flags. Dependencies are auto-resolved by parsing CMakeLists.txt find_package(...) directives and mapping them using a cached GPT-derived list, then re-invoking the build system (Tan et al., 19 May 2025).
The critical "Source-Trace Algorithm" associates binary functions to their true source using DWARF line references and Tree-sitter parsing to maximize line-segment overlap. The mapping is selected by scoring source candidates via overlap with DWARF line segments:
Subsequently, functions not defined within the repository, in-binary duplicates, and near-duplicates across repositories are removed using project-scope filtering, overlap pruning, and MinHash-LSH deduplication (Tan et al., 19 May 2025). This pipeline ensures high-quality, one-to-one binaryāsource function correspondence.
3. Benchmark Scale and Structural Statistics
GitHub2025 comprises 121 repositories and approximately 60,000 function pairs after all selection and deduplication steps. Binaries are emitted for each optimization level per repository, yielding up to 484 unique binaries. The function pool exhibits the following language distribution: C ā 65%, C++ ā 35%; function lengths are typically 5ā50 lines, as depicted in the length histogram (see Decompile-Bench Figure 3b).
| Opt Level | # Binaries | # Functions |
|---|---|---|
| O0 | 121 | 15,300 |
| O1 | 121 | 15,450 |
| O2 | 121 | 15,620 |
| O3 | 121 | 13,630 |
| Total | 484 | ā60,000 |
A plausible implication is that this systematic design captures a broad spectrum of real-world idioms, control flows, and data-structure patterns, spanning utility functions, data manipulations, string/file I/O, and numerical routines (Tan et al., 26 Sep 2025).
4. Evaluation Protocol and Metrics
GitHub2025 is positioned as the principal real-world, temporally held-out evaluation suite for LLM-based decompilers. The evaluation protocol requires producing C code from stripped binaries (using IDA Pro pseudocode as an intermediate), ensuring that no decompiler has seen any constituent project or its dependencies in the pre-2025 training datasets. Data-leakage mitigation comprises both the temporal split and subdirectory stripping to eliminate vendor, external, or submodule code (Tan et al., 19 May 2025).
Core evaluation metrics, as implemented in recent benchmarks (Tan et al., 26 Sep 2025), include:
- R2I (Relative Readability Index): Quantifies structural and syntactic readability (0ā1) using 31 AST features; customized to assign 0 to unparsable outputs and uses Psyche-C for header generation.
- GPT-Judge: Evaluates identifier recovery quality (scale 1ā5) via the GPT-5-mini model trained with a comparative, form-filling prompt.
- Re-executability: Not computed on GitHub2025 but on other benchmarks; measures whether synthesized code can be successfully executed.
These metrics enable comparison across decompiler architectures and optimization levels, facilitating fine-grained diagnosis of limitations in structure recovery and naming.
5. Baselines and Empirical Results
GitHub2025 serves as a critical differentiator for state-of-the-art decompilers, especially when compared to contest- or synthetic-style benchmarks. All major systems are evaluated under identical pipelines:
- GPT-5-mini: Prompted on IDA pseudocode, no task-specific fine-tuning.
- LLM4Decompile (6.7B): Single-phase supervised fine-tuning checkpoint.
- Idioms: Joint structure and type recovery with call-graph context (arXiv (Dramko et al., 6 Feb 2025)).
- SK²Decompile: Two-phase RL-optimized skeleton (structure) ā skin (identifier) model.
On GitHub2025, SK²Decompile achieves substantially improved readability and naming metrics:
| Model | R2I (AVG) | GPT-Judge (AVG) |
|---|---|---|
| GPT-5-mini | 28.95 | 2.87 |
| LLM4Decompile | 47.95 | 2.62 |
| Idioms | 57.97 | 2.18 |
| SK²Decompile | 74.99 | 3.06 |
SK²Decompileās two-phase process yields a 29.4% average R2I improvement over Idioms, substantiating the advantage of decoupling structure and identifier recovery, particularly on the diverse, real-world function pool in GitHub2025 (Tan et al., 26 Sep 2025).
6. Access, Distribution, and Reproducibility
The full GitHub2025 dataset is publicly available via the Decompile-Bench HuggingFace release and associated source/build scripts on GitHub. Access involves:
1 2 3 |
pip install datasets from datasets import load_dataset ds = load_dataset("LLM4Binary/decompile-bench-eval") |
On-disk structure includes github2025/repos.jsonl for repository metadata, github2025/bin/<opt_level>/*.so for binaries, and github2025/mappings/ for extracted pairings. The precise compilation, line-tracing, and deduplication procedure can be exactly replicated using the releases and documented scripts (Tan et al., 19 May 2025).
A plausible implication is that such distribution standards, coupled with strict temporal exclusion, establish GitHub2025 as the canonical benchmark for future LLM decompiler generalization, eliminating prior confounding effects from source-code overlap or synthetic pipeline artifacts.
References:
- [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation, (Tan et al., 19 May 2025)]
- [SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin, (Tan et al., 26 Sep 2025)]