Papers
Topics
Authors
Recent
Search
2000 character limit reached

SK²Decompile: Modular Neural Binary Decompilation

Updated 7 February 2026
  • SK²Decompile is a neural binary decompilation system that leverages a modular two-phase pipeline for structure recovery and identifier naming.
  • It improves functional correctness by decoupling control-flow recovery from semantic name regeneration, achieving significant gains in re-executability and identifier accuracy as validated on key benchmarks.
  • The framework employs reinforcement learning with compiler feedback and fine-grained alignment to bridge the gap between traditional decompilers and human-readable source code.

SK²Decompile refers to a class of neural binary decompilers that employ a modular, multi-phase strategy based on recent advances in LLMs, with the explicit goal of maximizing both the functional correctness and the human readability of decompiled C source from stripped binaries. The central design splits decompilation into sequential phases—usually structure recovery followed by identifier regeneration (the "skeleton-to-skin" paradigm)—or leverages fine-grained alignment and contextual reconstruction to overcome limitations of previous end-to-end or single-phase approaches. SK²Decompile systems, such as those presented in (Tan et al., 26 Sep 2025) and (Feng et al., 2024), represent the state-of-the-art in LLM-based decompilation by achieving substantial gains in decompiled code re-executability, identifier accuracy, and readability compared to both traditional rule-based tools and single-stage LLM methods.

1. Motivation and Conceptual Framework

Historically, binary decompilation has entailed a trade-off between functional fidelity (i.e., correct control/data-flow reconstruction) and human interpretability (i.e., succinct, semantically meaningful identifiers and source structures). Rule-driven tools like Ghidra and IDA Pro focus on reconstructing functional logic but produce obfuscated, unreadable output, while monolithic LLM-based decoders yield more human-like code at the expense of accurate low-level semantics, often resulting in miscompilation of nearly half the benchmarks (Tan et al., 26 Sep 2025, Feng et al., 2024).

SK²Decompile introduces a principled division of the task based on the Information-Bottleneck principle. By extracting an obfuscated, structurally correct intermediate representation ("skeleton") and subsequently inferring identifiers ("skin"), SK²Decompile enables orthogonal optimization of correctness and readability (Tan et al., 26 Sep 2025). Variants exploit in-context demonstration (sc²dec), statement-level alignment, and multi-objective optimization to address decompilation’s inherent ambiguities and combinatorial complexity (Feng et al., 2024).

2. Two-Phase Decompilation Pipeline: Structure Recovery and Identifier Naming

SK²Decompile systems implement a two-phase workflow:

  1. Structure Recovery Stage: The first module translates stripped, IDA-style pseudocode into an obfuscated C-like IR where all identifiers are replaced by placeholders (e.g., FUN_1, VAR_1). This IR preserves all control and data structures, expression nesting, and general syntax needed for compilability, but omits semantic naming (Tan et al., 26 Sep 2025). The underlying architecture is a 6.7B-parameter transformer (LLM4Decompile) fine-tuned in sequence-to-sequence mode, leveraging multi-head attention, cross-attention decoding, and interleaved call-graph context.

Training proceeds via supervised cross-entropy at the token level,

LCE(θ)=i=1NlogPθ(yiy<i,x),\mathcal{L}_{CE}(\theta) = -\sum_{i=1}^N \log P_\theta(y_i \mid y_{<i}, x),

then switches to reinforcement learning (policy-gradient RL). The RL reward, rstructurer_{\text{structure}}, combines IR compilability (binary via Psyche-C) and Jaccard similarity between emitted and ground-truth placeholders:

rstructure={0,if compilation fails 1+rplaceholder,if compilation succeedsr_{\text{structure}} = \begin{cases} 0, & \text{if compilation fails} \ 1 + r_{\text{placeholder}}, & \text{if compilation succeeds} \end{cases}

where rplaceholder=IgenIIRIgenIIRr_{\text{placeholder}} = \frac{|I_{\text{gen}} \cap I_{\text{IR}}|}{|I_{\text{gen}} \cup I_{\text{IR}}|} (Tan et al., 26 Sep 2025).

  1. Identifier Naming Stage: The second module receives the recovered IR and regresses meaningful identifier names for all placeholders, producing human-readable C code. This is performed by a second transformer LLM, again initialized from LLM4Decompile-6.7B, trained with cross-entropy and further reinforced on the semantic similarity of its predictions to reference code. The naming reward leverages dense learned embeddings (e.g., CodeBERT) such that

rnaming=cos(egen,esrc)δLengthPenalty(I^)r_{\text{naming}} = \cos(\mathbf{e}_\text{gen}, \mathbf{e}_\text{src}) - \delta \cdot \text{LengthPenalty}(\hat{I})

which encourages semantic faithfulness without verbosity (Tan et al., 26 Sep 2025).

This decoupled design avoids combinatorial explosion in search space, supports independent RL-driven improvement of structural fidelity and name recovery, and is empirically validated to close the gap between compiler-grade correctness and human-grade nomenclature.

3. Self-Constructed Context and Fine-Grained Alignment (sc²dec + FAE)

Alternative SK²Decompile approaches use in-context demonstration (sc²dec) and fine-grained alignment enhancement (FAE) (Feng et al., 2024):

  • sc²dec: At inference time, the model decompiles the assembly, recompiles and re-disassembles its own output, and then constructs a single tailored assembly–source demonstration. This demonstration is prepended as context to a new decompilation run on the original input, aligning the LLM with the style and idiosyncrasies of the target compiler/optimizer. Statistically, sc²dec confers a 3.84% average re-executability gain on the Decompile-Eval benchmark.
  • FAE: During fine-tuning, compilers with DWARF debug info are used to synthesize step-aligned assembly/source training pairs, enabling the LLM to map segments of assembly to corresponding C statements. The loss combines end-to-end and alignment objectives:

Ltotal=Ldecomp+λLalign\mathcal{L}_{total} = \mathcal{L}_{decomp} + \lambda \mathcal{L}_{align}

where λ1\lambda \approx 1, with Lalign\mathcal{L}_{align} enforcing statement-level consistency and improving recoverability (Feng et al., 2024).

This two-stage recipe achieves a 7.35% absolute re-executability gain over baseline on Decompile-Eval, with both components contributing additive improvements.

4. Benchmarking Methodologies and Empirical Results

SK²Decompile models are systematically evaluated using large-scale benchmarks with paired assembly and source, covering a range of optimization levels and real-world idioms:

  • HumanEval: 164 hand-authored Python-to-C algorithmic functions, each validated by unit tests, probing both correctness and name recovery (Tan et al., 26 Sep 2025).
  • GitHub2025: 23,400 stripped C functions from 130 projects, with diverse APIs and coding patterns (Tan et al., 26 Sep 2025).
  • Decompile-Eval: General-purpose decompilation benchmark suite, covering compilation, re-executability, and identifier accuracy at gcc -O0/1/2/3 (Feng et al., 2024).

Key metrics include:

Metric Definition
Re-executability rate Fraction of decompiled functions that compile and pass all provided tests
Relative Readability Index (R2I) AST-based score [0,1] quantifying code structure clarity
Identifier Recovery Improvement (R2I%) Ratio of recovered identifiers to reference, as a percentage

Selected results:

  • On HumanEval, SK²Decompile achieves 69.0% mean re-executability, a 21.6 pp gain over GPT-5-mini.
  • On GitHub2025, SK²Decompile attains 74.99% R2I (vs. Idioms' 57.97%), a 29.4 pp relative improvement (Tan et al., 26 Sep 2025).
  • sc²dec + FAE yields overall 55.03% average re-executability (vs. 47.68% baseline) on Decompile-Eval (Feng et al., 2024).

Ablative analysis confirms the benefit of each phase and reward: decompositional design yields ~9 points, compiler-checked RL ~10 points, and semantic naming rewards smaller but consistent gains (Tan et al., 26 Sep 2025). Removing FAE step-wise alignment drops re-executability by ≈5% (Feng et al., 2024).

5. Comparative Analysis and Implementation

Relative to both LLM-based and commercial decompilers, SK²Decompile outperforms in two primary axes:

  • Functional correctness: Achieves higher compilability and test-passing rates, verifying the value of RL with structural/compilability rewards.
  • Human readability: Significantly improves identifier naming and code layout metrics, as measured by R2I and identifier recovery.

A summary table of empirical results:

System HumanEval ReExec GitHub2025 R2I Decompile-Eval ReExec (O0–O3)
LLM4Decompile 47.68%
GPT-5-mini 56.75%
Idioms 57.97%
SK²Decompile 69.0% 74.99% 55.03%

Implementation specifics include the use of LLM4Decompile-6.7B parameter models, training via one epoch SFT (batch 128, 3×1063 \times 10^{-6} LR), GRPO RL (veRL), compiler feedback (Psyche-C), semantic embeddings (qwen-embedding-0.6B), and hardware comprising NVIDIA H800 (80 GB) clusters. Decoding is greedy to minimize output stochasticity (Tan et al., 26 Sep 2025, Feng et al., 2024).

6. Limitations and Open Directions

SK²Decompile frameworks rely on several assumptions and have specific constraints:

  • Compiler Feedback: Both RL and sc²dec stages presume the ability to compile and execute decompiler outputs; if initial LLM outputs are uncompilable, contextual refinement cannot be used (falls back to raw output) (Feng et al., 2024).
  • Debug Information: Fine-grained alignment (FAE) presumes the availability of DWARF debug info; fully stripped binaries require alternative heuristics (Feng et al., 2024).
  • Scaling Effects: FAE was evaluated on 10k functions and one LLM size; the effect of scaling corpus size or model parameters remains open (Feng et al., 2024).
  • Prompt Flexibility: Fine-tuning with alignment objectives can reduce zero/one-shot generalization on out-of-domain prompts, indicating a trade-off between contextual exploitation and few-shot transfer (Feng et al., 2024).

A plausible implication is that future research may focus on extending alignment-based training to larger corpora, cross-compiler generalization, and symbol-free binaries. There is also interest in investigating how to further integrate control-flow graph reasoning, semantic property decoding, or hybrid symbolic methods into LLM-driven pipelines.

7. Significance and Impact

SK²Decompile’s modular approach closes the historical gap between accurate decompilation and source-level usability by explicitly dividing and optimizing subproblems. Its two-phase (or, in alternative designs, context- and alignment-enhanced) structure enables significantly improved re-executability and readability, setting new state-of-the-art metrics across major benchmarks (Tan et al., 26 Sep 2025, Feng et al., 2024). This framework provides a reproducible, scalable template for future systems seeking to balance correctness and human interpretability in binary analysis, reverse engineering, and software provenance recovery.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SK²Decompile.