LLMs as Idiomatic Decompilers: Recovering High-Level Code from x86-64 Assembly for Dart

Published 2 Apr 2026 in cs.SE | (2604.02278v1)

Abstract: Translating machine code into human-readable high-level languages is an open research problem in reverse engineering. Despite recent advancements in LLM-based decompilation to C, modern languages like Dart and Swift are unexplored. In this paper, we study the use of small specialized LLMs as an idiomatic decompiler for such languages. Additionally, we investigate the augmentation of training data using synthetic same-language examples, and compare it against adding human-written examples using related-language (Swift -> Dart). We apply CODEBLEU to evaluate the decompiled code readability and compile@k to measure the syntax correctness. Our experimental results show that on a 73-function Dart test dataset (representing diverse complexity levels), our 4B specialized model achieves 71.3 CODEBLEU (95% CI 65.5-77.1), approximately comparable to a ~480B code model (73.1; 67.4-78.8). On a subset of 34 natural Dart functions, it reaches compile@k5 = 79.4% (Wilson 95% CI 63.2-89.7), vs. 64.7% (47.9-78.5) for the base model; the difference is suggestive but not statistically significant at 0.05. Our results indicate that adding Swift training data helps at 8B but not at 4B, suggesting a capacity threshold for effective cross-lingual transfer. Our experimental results show that small specialized models can generate readable, idiomatic Dart with meaningful identifiers while using minimal compute.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a neural pipeline that decompiles optimized x86-64 binaries into idiomatic Dart code using fine-tuned small LLMs.
It contrasts synthetic augmentation with Swift-based cross-lingual transfer, revealing capacity-dependent effects on CodeBLEU and compile@k metrics.
The work shows that compute-efficient domain-specialized LLMs can rival foundation models while addressing unique decompilation challenges.

LLMs as Idiomatic Decompilers for Dart: High-Level Recovery from x86-64 Assembly

Overview

The paper "LLMs as Idiomatic Decompilers: Recovering High-Level Code from x86-64 Assembly for Dart" (2604.02278) systematically investigates the capacity of specialized LLMs to recover highly readable, idiomatic Dart code from x86-64 binaries. The core contributions are: (1) the first neural pipeline for Dart/Swift decompilation from assembly, (2) an analysis contrasting synthetic augmentation and related-language (Swift) transfer in low-resource contexts, and (3) a comprehensive evaluation using CodeBLEU and compile@k metrics, including uncertainty quantification. Results show small LLMs can closely match outputs from models over two orders of magnitude larger, with subtle behavior in cross-lingual training depending on model capacity.

Background and Motivation

Decompilation—translating machine code to human-readable languages—is critical for reverse engineering. While LLMs have outperformed classical decompilers by generating more semantically faithful and idiomatic C code, adaptation to modern languages such as Dart and Swift has remained unexplored. These languages' advanced features (e.g., null-safety, closure semantics, higher abstraction levels) pose unique challenges due to sparse available training data and more complex idiomatic constructions compared to C.

The paper targets this gap by investigating whether fine-tuned small LLMs can recover high-level idiomatic Dart code from optimized binaries, and whether incorporating related-language (Swift) examples benefits small models under data scarcity.

Methodology

The approach employs a multi-stage pipeline: source code in Dart or Swift is compiled into optimized (Dart: AOT, Swift: -O0) x86-64 binaries, disassembled to assembly code, and then decompiled to the high-level language using custom fine-tuned LLMs. Model performance is evaluated by CodeBLEU for code similarity and idiomaticity, and compile@k for syntactic correctness.

Figure 1: Decompilation pipeline from Dart/Swift source through binary and x86-64 assembly to high-level code, with CodeBLEU and compile@k evaluation metrics.

Data Construction

Training Data: Two main sets—(1) Dart-only (246 natural + 948 synthetic, total 1,194), and (2) Dart+Swift (246 Dart + 754 Swift, total 1,000)—were constructed. Synthetic examples were generated by leading LLMs (GPT, Claude, DeepSeek, Qwen), and validated for syntactic integrity.
Augmentation Methods: Compared are (a) expansion via synthetic Dart pairs and (b) including human-written Swift $\rightarrow$ Dart pairs.
Assembly Generation: Dart uses production-style AOT compilation with boundary-preserving pragmas; Swift is compiled with -O0 for structural alignment. An observed confound is the mismatch in optimization, potentially impacting patterns learned across languages.
“Thinking tokens”: Sparse annotation with chain-of-thought traces is introduced to help guide cross-level semantic mapping.

Model Architecture and Training

Base Models: Qwen3-4B and DeepSeek-R1-8B represent the core architectures, both with state-of-the-art attention and activation mechanisms, extended to 32k context windows via distillation.
Fine-Tuning: LoRA with DoRA enhancement targets full transformer blocks, with modest compute (single NVIDIA H200, $\leq$ 3 hours). Label smoothing and aggressive dropout calibrate for generalization under low data.

Evaluation Protocol

CodeBLEU: Measures n-gram overlap, AST, and data-flow similarity, with careful decoding parity across models (temperature 0.2, top-p 0.99, beam 1).
compile@k: Success rate for syntax-valid code, using $k=5$ decodings per input; statistical significance determined by Wilson intervals and two-proportion $z$ -tests.
Idiomaticity: Qualitative analysis (single reviewer, preliminary).

Results

Code Similarity (CodeBLEU)

The 4B Dart-specialized model achieves 71.3 CodeBLEU ( $95\%$ CI: 65.5–77.1) on a 73-function Dart test set, almost matching Qwen3-Coder-Plus ( $\sim$ 480B, CodeBLEU 73.1).
Adding Swift to 4B reduces performance ( $-2.1$ points); at 8B, adding Swift improves CodeBLEU by +8.7 points, indicating model capacity threshold effects.
The 4B specialized model consistently outpaces the base model (66.1).
Larger models (up to $\sim$ 1T) achieve higher scores (Qwen3-Max: 77.4), but small, focused models approach this upper bound at a fraction of the compute budget.

Executability (compile@k)

For 34 natural Dart functions, the specialized 4B model reaches 79.4% compile@5 (CI: 63.2–89.7), a suggestive +14.7% over base, but not statistically significant at the 0.05 level.
At $k=1$ , compile rates are uniformly lower, and CIs across models overlap.
Notably, higher compilation rates for specialized models are accompanied by slightly decreased CodeBLEU for compiled-only samples, suggesting they tackle more challenging cases (i.e., increased coverage with more “difficult” code).

Qualitative/Idiomatic Output

Preliminary assessment finds the specialized models generate more idiomatic Dart, leveraging null-safety, expressive identifiers, and language-specific control flow, unlike baseline model outputs which often lack semantic structure or meaningful names.

Analysis: Cross-language Transfer vs. Synthetic Augmentation

A central empirical result is the capacity-dependent utility of cross-lingual data:

At 4B scale, augmenting with Swift impairs Dart performance due to representational bottlenecks and increased cross-lingual interference.
At 8B, performance benefits emerge, with models exploiting shared idioms without significant interference.

Key contributors to this phenomenon include: (1) capacity saturation in smaller models, (2) negative interference from ambiguous language features, and (3) confounds introduced by mismatched optimization levels in assembly between Dart (AOT) and Swift (-O0).

Limitations

Semantic Faithfulness: Only syntactic correctness (compile@k) is measured; no thorough unit-test–based semantic evaluation is performed (pass@k).
Optimization Confound: Disparity in binary optimization between Dart and Swift is a confounding factor in transfer experiments.
Human Evaluation: Idiomaticity claims are not formally substantiated (single reviewer, lacking inter-rater calibration).
Reproducibility: Some hyperparameter choices not strictly documented.

Implications and Future Directions

Practical Decompilation: Small models enable performant, compute-efficient decompilation for modern high-level languages, crucial for reverse engineering, vulnerability analysis, and software archaeology.
Scalable Training: Data augmentation strategies must account for scale-dependent transfer/interference effects when combining multiple programming languages.
New Evaluation Protocols: Full semantic unit tests (pass@k) and matched binary optimizations are necessary for unequivocal model comparison.
Expansion to Other Architectures: Generalization to ARM and RISC-V and mapping effects of various compilation flags warrant further study.

Conclusion

This work establishes that domain-specialized small LLMs (4B) are competitive with foundation models over two orders of magnitude larger for Dart decompilation from optimized binaries. Critically, effectiveness of cross-lingual augmentation hinges on sufficient model capacity and aligned optimization strategies; otherwise, interference outweighs transfer. These findings have direct implications for building practical, efficient decompilers for contemporary software ecosystems and highlight the necessity for nuanced evaluation and tailoring of codal LLMs for emerging languages and use-cases.