CODA Framework: Neural Decompiler
- CODA Framework is a neural-based decompilation architecture that translates binary executables into high-level source code via AST generation and error correction.
- It employs an instruction type-aware encoder and tree-structured decoder to preserve semantics and syntactic accuracy across diverse ISAs.
- The framework integrates neural prediction with symbolic correction to iteratively refine code sketches, significantly improving program recovery accuracy.
CODA Framework (A Neural-Based Program Decompiler)
The CODA framework is an end-to-end neural code decompilation architecture that advances the task of translating binary executables to high-level source code. CODA addresses major shortcomings of traditional, heuristic-based decompilers, including their limited applicability to specific language pairs, inadequate functional preservation, and poor semantic interpretability. By leveraging neural program modeling, tree-structured decoding, and iterative symbolic error correction, CODA establishes state-of-the-art performance for program recovery from binaries, with significant advances in generalization, correctness, and robustness (Fu et al., 2019).
1. System Architecture
CODA decomposes the decompilation process into two major, sequential phases:
- Phase 1: Code Sketch Generation The input binary assembly is first parsed into a sequence of statements. An instruction type-aware encoder transforms this sequence into a set of hidden states, which condition a tree-structured decoder to generate an abstract syntax tree (AST) as a rough "code sketch."
- Phase 2: Iterative Error Correction An ensembled error predictor inspects the sketch for likely errors, proposing edits based on neural signals. A symbolic correction module applies candidate edits, re-compiles the candidate, and accepts only those that reduce the Levenshtein distance (LD) against the binary's behavior, iterating until convergence.
This two-phase architecture allows CODA to move rapidly from a plausible high-level candidate to exact, testable source-level reconstructions, even from unseen binaries and diverse ISAs (Fu et al., 2019).
2. Instruction Type-Aware Encoder
Input Representation
- Each assembly statement is tokenized into opcode and up to three operands.
- Operands are typed (register, immediate, memory, label) with distinct embeddings.
- Three n-ary Tree-LSTM modules are allocated for instruction families: memory ("mem"), arithmetic ("art"), and branch ("br").
Neural Encoding
At statement index :
- : embedding for opcode.
- : context hidden/cell state from previous statement.
- : embedded state for operand .
For instruction type ,
This structure facilitates family-specific compositional modeling of the assembler's operational semantics.
3. AST Tree Decoder and Attention
AST Representation
- CODA employs a binary left-child/right-sibling encoding to handle arbitrary AST node arity within tree-structured decoding.
- Two variant LSTM decoders govern left-child (LSTM_L) and right-sibling (LSTM_R) tree expansions.
Attention and Node Prediction
- At each expansion step , given decoder state and parent token embedding :
- Compute attention over 0 encoder states 1:
2 - Fuse context and state: 3. - Predict next token:
4 - Recursively update 5 for left/right subtrees.
This design enables consistent generation of high-level structured code, aligned with the semantics of the input binary.
4. Iterative Error Correction and Ensembled Prediction
Error Predictor (EP)
EP operates on both the sketch AST nodes and ground-truth nodes, generating:
- An error flag (binary).
- Error type (misprediction, missing-statement, extra-statement).
- Correction token (for mispredicted nodes).
- Sequential processing via GRU over concatenated encoderādecoder hidden pairs, with attention for cross-reference.
Iterative Correction Machine
- All flagged nodes/errors from 6 predictors populate a priority queue 7.
- Top candidates are edited via symbolic delta procedures (token replace, subtree add/remove).
- Acceptance gating: only adopt corrections that do not worsen LD between binary and candidate AST re-compilation.
- Loop halts at perfect match (8) or after a fixed iteration budget.
This closed-loop system refines candidate programs, balancing neural recognition and symbolic verification for correctness.
5. Training Objectives and Optimization
- Sketch Generation Loss: cross-entropy over AST node predictions.
9
- Error Predictor Loss: multi-task nodewise loss with flag, type, and token outputs.
0
- Optimization: Adam optimizer, with learning rate schedule, gradient clipping (norm 1.0), batch sizes 50 (sketch) and 10 (EP), and dropout 0.5 on recurrent layers.
For serious label imbalance, āno-errorā negatives are sub-sampled to ensure ~35% representation per batch (Fu et al., 2019).
6. Experimental Results and Comparative Evaluation
Benchmarks and Metrics
- Synthetic benchmarks: Karel, library math, normal expr., mix.
- Real programs: PyTorch C++, Hackerās Delight, unseen control/data shapes.
- Sourceātarget ISAs: MIPS, x86-64.
- Metrics: token accuracy (fractional), program accuracy (exact AST match).
Performance
- Inst2AST+Attn achieves ~97% token accuracy (MIPS); full CODA (~82% program recovery), exceeding baseline seq2seq+attn by +70 points.
- On x86-64, ~90% token and ~80% program accuracy.
- RetDec and byte-seq2seq classical baselines yield 0% exact recovery.
- Qualitative EC repairs capture missed control (e.g., swapping "if" vs "while", injecting omitted returns).
- Generalizes to unseen binary/data structures and ISAs with minimal code change (Fu et al., 2019).
7. Impact, Strengths, and Limitations
CODA demonstrates a novel synergy of neural perception (sequence and tree modeling) with symbolic, compiler-level error correction, yielding a platform for ISAāagnostic and high-fidelity decompilation.
Advantages:
- Preserves AST syntax and original binary semantics (through recompilation and LD verification).
- Outperforms both rule-based and vanilla neural decompilers across a spectrum of languages and architectures.
- Minimal per-task hand-engineering due to general encoder/decoder setups.
Limitations and Future Directions:
- Variable identifiers, types, and richer high-level annotations are not recovered (output is anonymized code).
- Scalability to extremely large binaries is as yet untested.
- Deeper integration of dynamic I/O, advanced type/alias inference, and further symbolic methods could increase robustness and practical coverage.
CODAās hybrid two-phase framework provides a foundation for next-generation program decompilation, combining the strengths of neural modeling and formal correctness constraints (Fu et al., 2019).