LLM4Decompile: Neurosymbolic Decompilation

Updated 20 March 2026

The paper demonstrates a neurosymbolic pipeline that reconstructs high-level code from low-level representations using static analysis, multi-phase Chain-of-Thought prompting, and test-driven refinement.
LLM4Decompile is defined by its integration of explicit symbolic insights, such as stack traces and control-flow graphs, with LLM reasoning to enhance accuracy and readability.
The system shows significant performance improvements with benchmarks indicating up to a 27.5 percentage point increase in functional correctness and human-centric readability.

LLM for Decompilation (LLM4Decompile) refers to a class of neurosymbolic frameworks and LLM-based architectures that reconstruct high-level source code from low-level representations such as binaries, assembly, or IR. These systems integrate advanced program analysis with the reasoning, translation, and code synthesis capabilities of state-of-the-art LLMs, explicitly targeting functional correctness, readability, and usability of the decompiled output.

1. Neurosymbolic Decompilation Pipeline

LLM4Decompile characteristically comprises three stages:

Static Stack or CFG Analysis
- The input (e.g., WebAssembly’s .wat or native assembly) is parsed using domain-specific static analysis to extract explicit symbolic representations of program state, such as stack evolutions for stack-based languages (Fang et al., 2024) or full control-flow graphs (CFGs) for register architectures (Liu et al., 10 Mar 2025, Achamyeleh et al., 21 Jan 2026).
- Example: for WASM/WAT, the stack state at instruction $i$ is represented as $S_i = f_{op_i}(S_{i-1})$ , tracking all value pushes/pops and splits at control flow constructs (Fang et al., 2024).
Neurosymbolic Chain-of-Thought (CoT) Prompting
- The static analysis output is embedded as structured context within a multi-step LLM prompt.
- The LLM is explicitly guided through:
  1. Type prediction (inferring data and parameter types).
  2. Variable semantic labeling (recovering human-meaningful names).
  3. Functional summarization (generating NL description).
  4. Source code emission (rendering compilable, idiomatic C/C++/other code) (Fang et al., 2024).
- Illustrative system prompt:
  1 2 3 4 5
  
  System: Translate the ANNOTATED WAT code step by step: Step 1: Type Prediction... Step 2: Variable Semantics... Step 3: NL Summary... Step 4: C++ Generation...
Test-driven Validation and Refinement
- Output is compiled and unit-tested using a test harness, and results are reported as pass/fail rates.
- Results can be iteratively refined, with corrective prompts or post-processing toolchains (e.g., error repair, memory sanitizer passes) (Wong et al., 2023).

2. Integration of Symbolic Analysis and LLM Reasoning

A defining feature of LLM4Decompile is the tight coupling of symbolic program analysis with LLM learning and inference. This process exposes a program’s internal execution semantics in forms amenable to LLM reasoning:

Explicit stack traces (for stack machines): Annotated stack after every instruction helps prevent hallucinated types, guides variable renaming, and ensures accurate control-flow reasoning (Fang et al., 2024).
Abstract logic trees and CFGs (for register machines): Hierarchical pattern extraction (loops, branches, atomic blocks) produces Source-Level Abstract Logic Trees (SALT) (Wang et al., 18 Sep 2025) or structured graph overviews (Achamyeleh et al., 21 Jan 2026).
Alignment with source-level constructs: Fine-grained mappings (e.g., DWARF-based) align assembly or IR blocks with source-level statements, enabling statement-level learning objectives (Feng et al., 2024).

The LLM is furnished with these symbolic summaries through prompt engineering, directing attention to crucial structural and semantic features that would be lost in linear token streams.

3. Architectures, Prompting, and Losses

LLM4Decompile systems are generally implemented on top of large decoder-only (transformer-based) LLMs pretrained on code corpora (e.g., DeepSeek-Coder, CodeLlama, GPT-4-class models). Specific enhancements include:

Multi-phase CoT prompting: Segregation of reasoning over types, semantics, control flow, and code generation into discrete steps, promoting transparency and debuggability (Fang et al., 2024).
Fine-tuning on synthetic and mined code pairs: SL and RL fine-tuning on millions of (binary/IR → high-level code) pairs, sometimes leveraging real-world projects with complex constructs (see Decompile-Bench (Tan et al., 19 May 2025), CodableLLM (Manuel et al., 2 Jul 2025)).
Auxiliary and joint losses: In addition to standard cross-entropy, some frameworks introduce step-by-step or alignment losses (statement-level, semantic feature-based) (Feng et al., 2024, Wang et al., 18 Sep 2025).
Reinforcement learning with task-specific reward design: Two-phase RL in SK2Decompile, focusing first on structure preservation/compilability, then on semantic identifier/naming alignment (Tan et al., 26 Sep 2025).

4. Empirical Performance and Benchmarks

Evaluation is rigorous, relying on functional correctness under re-execution and human-centric readability/comprehension scores:

Benchmark	Metric	SOTA LLM4Decompile	Relevant Baseline	Δ (pp)	Reference
HumanEval-X	PassRate	60.6%	GPT-4 ICL-1shot: 46.5%	+14.1	(Fang et al., 2024)
MBXP	PassRate	85.9%	GPT-4 ICL-1shot: 63.6%	+22.3	(Fang et al., 2024)
Decompile-Eval	TCP	70.4% (SALT4D)	59.8% (SccDec)	+10.6	(Wang et al., 18 Sep 2025)
HumanEval-Decompile	Re-Exec (%)	54.3% (ICL4D-R)	26.8% (LLM4Decomp)	+27.5	(Wang et al., 3 Nov 2025)
User Study	Code Similarity	75% Win Rate	<50% (baselines)	+25	(Fang et al., 2024)

Additional studies report high improvement in output readability, human comprehensibility, and robustness to obfuscation/optimization diversity (Wang et al., 18 Sep 2025, Feng et al., 2024).
Model scaling achieves only moderate returns beyond ~7B parameters; targeted fine-tuning and symbolic integration are more consequential (Shang et al., 12 May 2025).

5. Robustness, Limitations, and Best Practices

Known strengths:

Recovery of variable and type semantics when structural context is provided explicitly (Fang et al., 2024, Achamyeleh et al., 21 Jan 2026).
Handling of complex or obfuscated control flows using logic-tree or graph abstractions (Wang et al., 18 Sep 2025).
Effective post-processing and error handling through automated feedback loops (Wong et al., 2023, Achamyeleh et al., 21 Jan 2026).

Principal limitations:

High dependency on large LLMs (GPT-4 or comparable) for optimal results.
Need for hand-crafted static analyses for each IR or binary format (Fang et al., 2024).
Manual post-processing remains necessary for edge-case artifacts (e.g., pass-by-reference in C++).
LLMs still exhibit failures in instruction-level control flow, e.g., incorrect loop/branch reconstruction in some settings (Jiang et al., 7 Feb 2025).

Best practices for extension and deployment:

Construct a static analyzer for explicit symbolic state tracking.
Employ multi-phase CoT prompting tailored to the IR/binary flavor.
Integrate real test oracles for both functional validity and ablation analysis of reasoning components.
Fine-tune sequentially on logic-structured, semantically aligned code/function pairs.
Consider RL with structured, hybrid rewards—compilability, structure-matching, and embedding-level semantic similarity (Tan et al., 26 Sep 2025).
Apply post-generation error repair and bounded in-prompt compiler feedback (Wong et al., 2023, Fang et al., 2024, Achamyeleh et al., 21 Jan 2026).

6. Emerging Variants and Research Directions

Several frameworks propose variations along the symbolic–neural axis:

SALT4Decompile leverages logic-block trees for binary code, achieving robustness to obfuscation and high human-comprehensibility (Wang et al., 18 Sep 2025).
SK2Decompile decomposes the task into “skeleton” (structure) and “skin” (naming) via sequential RL to maximize re-executability and readability (Tan et al., 26 Sep 2025).
HELIOS uses hierarchical CFG abstraction with prompt-encoded constraints, raising compilability by over 40 percentage points without LLM retraining (Achamyeleh et al., 21 Jan 2026).
ICL4Decomp retrieves semantically aligned in-context exemplars for each binary function, yielding up to 40% improvement in functional success (Wang et al., 3 Nov 2025).
D-LiFT introduces quality-driven RL fine-tuning, using cascading reward functions that ensure syntactic, semantic, and readability improvements only accrue for functionally correct rewrites (Zou et al., 11 Jun 2025).

Open problems include:

Extension to multi-function and whole-project binaries.
Automated discovery of compiler optimization context.
Semantic-preserving type and data-structure recovery for arbitrary architectures.
Integration with self-verifying symbolic execution and cross-compilation testing.

7. Historical and Conceptual Context

LLM4Decompile represents a convergence of neural machine translation for code (Katz et al., 2019), symbolic program analysis, and human-interpretable code synthesis. This contrasts with both purely symbolic decompilers and black-box neural translation models by enforcing structure-aware reasoning, symbolic alignment, and functional validation feedback. Early systems (Katz et al., 2019) established the feasibility of neural decompilation, but lacked robust handling of real-world assembly idioms, type inference, and human interfaces. Neurosymbolic recipes and chain-of-thought guidance, as formalized in recent work (Fang et al., 2024), have become central to achieving competitive accuracy and code usability.

References

StackSight: Unveiling WebAssembly through LLMs and Neurosymbolic Chain-of-Thought Decompilation (Fang et al., 2024)
Towards Neural Decompilation (Katz et al., 2019)
SALT4Decompile: Inferring Source-level Abstract Logic Tree for LLM-Based Binary Decompilation (Wang et al., 18 Sep 2025)
Can LLMs Understand Intermediate Representations in Compilers? (Jiang et al., 7 Feb 2025)
Refining Decompiled C Code with LLMs (Wong et al., 2023)
WaDec: Decompiling WebAssembly Using LLM (She et al., 2024)
Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement (Feng et al., 2024)
SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin (Tan et al., 26 Sep 2025)
BinMetric: A Comprehensive Binary Analysis Benchmark for LLMs (Shang et al., 12 May 2025)
LLM4Decompile: Decompiling Binary Code with LLMs (Tan et al., 2024)
HELIOS: Hierarchical Graph Abstraction for Structure-Aware LLM Decompilation (Achamyeleh et al., 21 Jan 2026)
Context-Guided Decompilation: A Step Towards Re-executability (Wang et al., 3 Nov 2025)
CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation (Manuel et al., 2 Jul 2025)
Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation (Tan et al., 19 May 2025)
The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs (Liu et al., 10 Mar 2025)
D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning (Zou et al., 11 Jun 2025)