Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM4Decompile: Neurosymbolic Decompilation

Updated 20 March 2026
  • The paper demonstrates a neurosymbolic pipeline that reconstructs high-level code from low-level representations using static analysis, multi-phase Chain-of-Thought prompting, and test-driven refinement.
  • LLM4Decompile is defined by its integration of explicit symbolic insights, such as stack traces and control-flow graphs, with LLM reasoning to enhance accuracy and readability.
  • The system shows significant performance improvements with benchmarks indicating up to a 27.5 percentage point increase in functional correctness and human-centric readability.

LLM for Decompilation (LLM4Decompile) refers to a class of neurosymbolic frameworks and LLM-based architectures that reconstruct high-level source code from low-level representations such as binaries, assembly, or IR. These systems integrate advanced program analysis with the reasoning, translation, and code synthesis capabilities of state-of-the-art LLMs, explicitly targeting functional correctness, readability, and usability of the decompiled output.

1. Neurosymbolic Decompilation Pipeline

LLM4Decompile characteristically comprises three stages:

  1. Static Stack or CFG Analysis
    • The input (e.g., WebAssembly’s .wat or native assembly) is parsed using domain-specific static analysis to extract explicit symbolic representations of program state, such as stack evolutions for stack-based languages (Fang et al., 2024) or full control-flow graphs (CFGs) for register architectures (Liu et al., 10 Mar 2025, Achamyeleh et al., 21 Jan 2026).
    • Example: for WASM/WAT, the stack state at instruction ii is represented as Si=fopi(Siāˆ’1)S_i = f_{op_i}(S_{i-1}), tracking all value pushes/pops and splits at control flow constructs (Fang et al., 2024).
  2. Neurosymbolic Chain-of-Thought (CoT) Prompting
    • The static analysis output is embedded as structured context within a multi-step LLM prompt.
    • The LLM is explicitly guided through:
      1. Type prediction (inferring data and parameter types).
      2. Variable semantic labeling (recovering human-meaningful names).
      3. Functional summarization (generating NL description).
      4. Source code emission (rendering compilable, idiomatic C/C++/other code) (Fang et al., 2024).
    • Illustrative system prompt:
      1
      2
      3
      4
      5
      
      System: Translate the ANNOTATED WAT code step by step:
      Step 1: Type Prediction...
      Step 2: Variable Semantics...
      Step 3: NL Summary...
      Step 4: C++ Generation...
  3. Test-driven Validation and Refinement
    • Output is compiled and unit-tested using a test harness, and results are reported as pass/fail rates.
    • Results can be iteratively refined, with corrective prompts or post-processing toolchains (e.g., error repair, memory sanitizer passes) (Wong et al., 2023).

2. Integration of Symbolic Analysis and LLM Reasoning

A defining feature of LLM4Decompile is the tight coupling of symbolic program analysis with LLM learning and inference. This process exposes a program’s internal execution semantics in forms amenable to LLM reasoning:

  • Explicit stack traces (for stack machines): Annotated stack after every instruction helps prevent hallucinated types, guides variable renaming, and ensures accurate control-flow reasoning (Fang et al., 2024).
  • Abstract logic trees and CFGs (for register machines): Hierarchical pattern extraction (loops, branches, atomic blocks) produces Source-Level Abstract Logic Trees (SALT) (Wang et al., 18 Sep 2025) or structured graph overviews (Achamyeleh et al., 21 Jan 2026).
  • Alignment with source-level constructs: Fine-grained mappings (e.g., DWARF-based) align assembly or IR blocks with source-level statements, enabling statement-level learning objectives (Feng et al., 2024).

The LLM is furnished with these symbolic summaries through prompt engineering, directing attention to crucial structural and semantic features that would be lost in linear token streams.

3. Architectures, Prompting, and Losses

LLM4Decompile systems are generally implemented on top of large decoder-only (transformer-based) LLMs pretrained on code corpora (e.g., DeepSeek-Coder, CodeLlama, GPT-4-class models). Specific enhancements include:

  • Multi-phase CoT prompting: Segregation of reasoning over types, semantics, control flow, and code generation into discrete steps, promoting transparency and debuggability (Fang et al., 2024).
  • Fine-tuning on synthetic and mined code pairs: SL and RL fine-tuning on millions of (binary/IR → high-level code) pairs, sometimes leveraging real-world projects with complex constructs (see Decompile-Bench (Tan et al., 19 May 2025), CodableLLM (Manuel et al., 2 Jul 2025)).
  • Auxiliary and joint losses: In addition to standard cross-entropy, some frameworks introduce step-by-step or alignment losses (statement-level, semantic feature-based) (Feng et al., 2024, Wang et al., 18 Sep 2025).
  • Reinforcement learning with task-specific reward design: Two-phase RL in SK2Decompile, focusing first on structure preservation/compilability, then on semantic identifier/naming alignment (Tan et al., 26 Sep 2025).

4. Empirical Performance and Benchmarks

Evaluation is rigorous, relying on functional correctness under re-execution and human-centric readability/comprehension scores:

Benchmark Metric SOTA LLM4Decompile Relevant Baseline Ī” (pp) Reference
HumanEval-X PassRate 60.6% GPT-4 ICL-1shot: 46.5% +14.1 (Fang et al., 2024)
MBXP PassRate 85.9% GPT-4 ICL-1shot: 63.6% +22.3 (Fang et al., 2024)
Decompile-Eval TCP 70.4% (SALT4D) 59.8% (SccDec) +10.6 (Wang et al., 18 Sep 2025)
HumanEval-Decompile Re-Exec (%) 54.3% (ICL4D-R) 26.8% (LLM4Decomp) +27.5 (Wang et al., 3 Nov 2025)
User Study Code Similarity 75% Win Rate <50% (baselines) +25 (Fang et al., 2024)

5. Robustness, Limitations, and Best Practices

Known strengths:

Principal limitations:

  • High dependency on large LLMs (GPT-4 or comparable) for optimal results.
  • Need for hand-crafted static analyses for each IR or binary format (Fang et al., 2024).
  • Manual post-processing remains necessary for edge-case artifacts (e.g., pass-by-reference in C++).
  • LLMs still exhibit failures in instruction-level control flow, e.g., incorrect loop/branch reconstruction in some settings (Jiang et al., 7 Feb 2025).

Best practices for extension and deployment:

  1. Construct a static analyzer for explicit symbolic state tracking.
  2. Employ multi-phase CoT prompting tailored to the IR/binary flavor.
  3. Integrate real test oracles for both functional validity and ablation analysis of reasoning components.
  4. Fine-tune sequentially on logic-structured, semantically aligned code/function pairs.
  5. Consider RL with structured, hybrid rewards—compilability, structure-matching, and embedding-level semantic similarity (Tan et al., 26 Sep 2025).
  6. Apply post-generation error repair and bounded in-prompt compiler feedback (Wong et al., 2023, Fang et al., 2024, Achamyeleh et al., 21 Jan 2026).

6. Emerging Variants and Research Directions

Several frameworks propose variations along the symbolic–neural axis:

  • SALT4Decompile leverages logic-block trees for binary code, achieving robustness to obfuscation and high human-comprehensibility (Wang et al., 18 Sep 2025).
  • SK2Decompile decomposes the task into ā€œskeletonā€ (structure) and ā€œskinā€ (naming) via sequential RL to maximize re-executability and readability (Tan et al., 26 Sep 2025).
  • HELIOS uses hierarchical CFG abstraction with prompt-encoded constraints, raising compilability by over 40 percentage points without LLM retraining (Achamyeleh et al., 21 Jan 2026).
  • ICL4Decomp retrieves semantically aligned in-context exemplars for each binary function, yielding up to 40% improvement in functional success (Wang et al., 3 Nov 2025).
  • D-LiFT introduces quality-driven RL fine-tuning, using cascading reward functions that ensure syntactic, semantic, and readability improvements only accrue for functionally correct rewrites (Zou et al., 11 Jun 2025).

Open problems include:

  • Extension to multi-function and whole-project binaries.
  • Automated discovery of compiler optimization context.
  • Semantic-preserving type and data-structure recovery for arbitrary architectures.
  • Integration with self-verifying symbolic execution and cross-compilation testing.

7. Historical and Conceptual Context

LLM4Decompile represents a convergence of neural machine translation for code (Katz et al., 2019), symbolic program analysis, and human-interpretable code synthesis. This contrasts with both purely symbolic decompilers and black-box neural translation models by enforcing structure-aware reasoning, symbolic alignment, and functional validation feedback. Early systems (Katz et al., 2019) established the feasibility of neural decompilation, but lacked robust handling of real-world assembly idioms, type inference, and human interfaces. Neurosymbolic recipes and chain-of-thought guidance, as formalized in recent work (Fang et al., 2024), have become central to achieving competitive accuracy and code usability.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Large Language Model for Decompilation (LLM4Decompile).