Papers
Topics
Authors
Recent
Search
2000 character limit reached

CODA Framework: Neural Decompiler

Updated 22 May 2026
  • CODA Framework is a neural-based decompilation architecture that translates binary executables into high-level source code via AST generation and error correction.
  • It employs an instruction type-aware encoder and tree-structured decoder to preserve semantics and syntactic accuracy across diverse ISAs.
  • The framework integrates neural prediction with symbolic correction to iteratively refine code sketches, significantly improving program recovery accuracy.

CODA Framework (A Neural-Based Program Decompiler)

The CODA framework is an end-to-end neural code decompilation architecture that advances the task of translating binary executables to high-level source code. CODA addresses major shortcomings of traditional, heuristic-based decompilers, including their limited applicability to specific language pairs, inadequate functional preservation, and poor semantic interpretability. By leveraging neural program modeling, tree-structured decoding, and iterative symbolic error correction, CODA establishes state-of-the-art performance for program recovery from binaries, with significant advances in generalization, correctness, and robustness (Fu et al., 2019).

1. System Architecture

CODA decomposes the decompilation process into two major, sequential phases:

  • Phase 1: Code Sketch Generation The input binary assembly is first parsed into a sequence of statements. An instruction type-aware encoder transforms this sequence into a set of hidden states, which condition a tree-structured decoder to generate an abstract syntax tree (AST) as a rough "code sketch."
  • Phase 2: Iterative Error Correction An ensembled error predictor inspects the sketch for likely errors, proposing edits based on neural signals. A symbolic correction module applies candidate edits, re-compiles the candidate, and accepts only those that reduce the Levenshtein distance (LD) against the binary's behavior, iterating until convergence.

This two-phase architecture allows CODA to move rapidly from a plausible high-level candidate to exact, testable source-level reconstructions, even from unseen binaries and diverse ISAs (Fu et al., 2019).

2. Instruction Type-Aware Encoder

Input Representation

  • Each assembly statement is tokenized into opcode and up to three operands.
  • Operands are typed (register, immediate, memory, label) with distinct embeddings.
  • Three n-ary Tree-LSTM modules are allocated for instruction families: memory ("mem"), arithmetic ("art"), and branch ("br").

Neural Encoding

At statement index nn:

  • xn∈Rdx_n \in \mathbb{R}^d: embedding for opcode.
  • (hn,cn)(h_n, c_n): context hidden/cell state from previous statement.
  • (hnopi,cnopi)(h_n^{op_i}, c_n^{op_i}): embedded state for operand i=0..2i=0..2.

For instruction type i∈{mem,art,br}i \in \{\text{mem}, \text{art}, \text{br}\},

  • (hn+1,cn+1)=LSTMi([hn;hnop0;hnop1;hnop2],[cn;cnop0;cnop1;cnop2],xn)(h_{n+1}, c_{n+1}) = \text{LSTM}_i\left([h_n; h_n^{op_0}; h_n^{op_1}; h_n^{op_2}], [c_n; c_n^{op_0}; c_n^{op_1}; c_n^{op_2}], x_n\right)

This structure facilitates family-specific compositional modeling of the assembler's operational semantics.

3. AST Tree Decoder and Attention

AST Representation

  • CODA employs a binary left-child/right-sibling encoding to handle arbitrary AST node arity within tree-structured decoding.
  • Two variant LSTM decoders govern left-child (LSTM_L) and right-sibling (LSTM_R) tree expansions.

Attention and Node Prediction

  • At each expansion step tt, given decoder state (ht,ct)(h_t, c_t) and parent token embedding HotHo_t:
    • Compute attention over xn∈Rdx_n \in \mathbb{R}^d0 encoder states xn∈Rdx_n \in \mathbb{R}^d1:

    xn∈Rdx_n \in \mathbb{R}^d2 - Fuse context and state: xn∈Rdx_n \in \mathbb{R}^d3. - Predict next token:

    xn∈Rdx_n \in \mathbb{R}^d4 - Recursively update xn∈Rdx_n \in \mathbb{R}^d5 for left/right subtrees.

This design enables consistent generation of high-level structured code, aligned with the semantics of the input binary.

4. Iterative Error Correction and Ensembled Prediction

Error Predictor (EP)

  • EP operates on both the sketch AST nodes and ground-truth nodes, generating:

    • An error flag (binary).
    • Error type (misprediction, missing-statement, extra-statement).
    • Correction token (for mispredicted nodes).
  • Sequential processing via GRU over concatenated encoder–decoder hidden pairs, with attention for cross-reference.

Iterative Correction Machine

  • All flagged nodes/errors from xn∈Rdx_n \in \mathbb{R}^d6 predictors populate a priority queue xn∈Rdx_n \in \mathbb{R}^d7.
  • Top candidates are edited via symbolic delta procedures (token replace, subtree add/remove).
  • Acceptance gating: only adopt corrections that do not worsen LD between binary and candidate AST re-compilation.
  • Loop halts at perfect match (xn∈Rdx_n \in \mathbb{R}^d8) or after a fixed iteration budget.

This closed-loop system refines candidate programs, balancing neural recognition and symbolic verification for correctness.

5. Training Objectives and Optimization

  • Sketch Generation Loss: cross-entropy over AST node predictions.

xn∈Rdx_n \in \mathbb{R}^d9

  • Error Predictor Loss: multi-task nodewise loss with flag, type, and token outputs.

(hn,cn)(h_n, c_n)0

  • Optimization: Adam optimizer, with learning rate schedule, gradient clipping (norm 1.0), batch sizes 50 (sketch) and 10 (EP), and dropout 0.5 on recurrent layers.

For serious label imbalance, ā€œno-errorā€ negatives are sub-sampled to ensure ~35% representation per batch (Fu et al., 2019).

6. Experimental Results and Comparative Evaluation

Benchmarks and Metrics

  • Synthetic benchmarks: Karel, library math, normal expr., mix.
  • Real programs: PyTorch C++, Hacker’s Delight, unseen control/data shapes.
  • Source–target ISAs: MIPS, x86-64.
  • Metrics: token accuracy (fractional), program accuracy (exact AST match).

Performance

  • Inst2AST+Attn achieves ~97% token accuracy (MIPS); full CODA (~82% program recovery), exceeding baseline seq2seq+attn by +70 points.
  • On x86-64, ~90% token and ~80% program accuracy.
  • RetDec and byte-seq2seq classical baselines yield 0% exact recovery.
  • Qualitative EC repairs capture missed control (e.g., swapping "if" vs "while", injecting omitted returns).
  • Generalizes to unseen binary/data structures and ISAs with minimal code change (Fu et al., 2019).

7. Impact, Strengths, and Limitations

CODA demonstrates a novel synergy of neural perception (sequence and tree modeling) with symbolic, compiler-level error correction, yielding a platform for ISA–agnostic and high-fidelity decompilation.

Advantages:

  • Preserves AST syntax and original binary semantics (through recompilation and LD verification).
  • Outperforms both rule-based and vanilla neural decompilers across a spectrum of languages and architectures.
  • Minimal per-task hand-engineering due to general encoder/decoder setups.

Limitations and Future Directions:

  • Variable identifiers, types, and richer high-level annotations are not recovered (output is anonymized code).
  • Scalability to extremely large binaries is as yet untested.
  • Deeper integration of dynamic I/O, advanced type/alias inference, and further symbolic methods could increase robustness and practical coverage.

CODA’s hybrid two-phase framework provides a foundation for next-generation program decompilation, combining the strengths of neural modeling and formal correctness constraints (Fu et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CODA Framework.