Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM4Decompile: Decompiling Binary Code with Large Language Models (2403.05286v3)

Published 8 Mar 2024 in cs.PL and cs.CL
LLM4Decompile: Decompiling Binary Code with Large Language Models

Abstract: Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in LLMs, we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

Decompiling Binary Code with LLMs: Introducing LLM4Decompile

Introduction to Decompilation and LLMs

Decompilation, the process of translating binary or bytecode back into human-readable source code, poses significant challenges, particularly in terms of preserving details like variable names and structural elements such as loops. Meanwhile, the advancement in LLMs for programming tasks suggests their potential utility in decompilation. As a pioneering effort, we present LLM4Decompile, the first open-source LLM specifically designed for decompilation, pre-trained on a substantial dataset of C source code and corresponding assembly instructions. Additionally, we introduce Decompile-Eval, a novel benchmark focusing on evaluating decompiled code based on re-compilability and re-executability, crucial indicators of a successful decompilation that were previously overlooked.

Key Challenges in Decompilation

Traditional decompilation tools often struggle with generating human-readable code that resembles the original source code in readability and structure. This is due to the inherent difficulty of reversing the compilation process, which loses certain information. Despite some success with Transformer-based models in addressing these issues, their limited size and lack of public availability have constrained their effectiveness and broader application. Furthermore, the absence of a standard benchmark for evaluating decompilation has impeded coherent progress in this field.

Introducing LLM4Decompile and Decompile-Eval

To remedy these limitations, we release LLM4Decompile, pre-trained LLMs ranging from 1B to 33B parameters, tailored for decompilation tasks. These models are trained on a dataset of 4 billion tokens, consisting of C source code and the corresponding assembly code compiled with various optimization levels. Alongside, Decompile-Eval is proposed as the first benchmark focused on re-compilability and re-executability of decompiled code, pioneering a more relevant evaluation framework for decompilation that prioritizes program semantics.

Evaluation and Results

Our models demonstrate a significant improvement over existing decompilation approaches, with our 6B LLM4Decompile achieving 87% re-compilability and 21% re-executability on Decompile-Eval. These figures indicate a comprehensive understanding of both the syntax and semantics of the code, surpassing the GPT-4 model significantly in these contexts.

Methodology

We compile C code into assembly using the GCC compiler across different optimization levels, and fine-tune the DeepSeek-Coder model on these assembly-source pairs. Our evaluation on Decompile-Eval assesses both the syntactic integrity and semantic accuracy of the decompiled code. The methodology adopted demonstrates that focusing on the sequence-to-sequence prediction significantly enhances the model's decompilation capabilities compared to other training strategies.

Theoretical and Practical Implications

Our research establishes a foundation for the application of LLMs in decompilation, significantly advancing the state-of-the-art. The introduction of Decompile-Eval as a benchmark directs future research towards more accurately assessing the practical utility of decompiled code. On a broader level, this work illuminates the path for applying large-scale LLMs to complex reverse-engineering tasks, potentially transforming practices in software maintenance, security analysis, and intellectual property evaluation.

Future Directions

The current scope is limited to C language and x86 architecture, focusing on decompiling single functions without considering external dependencies and cross-references. Future work could extend LLM4Decompile's methodology to other programming languages and architectural platforms, and address the complexities of decompiling entire software applications. This would encompass developing models that can accurately interpret and reconstruct the high-level constructs of complex software systems.

Conclusion

LLM4Decompile represents the forefront of leveraging LLMs for the task of decompilation, addressing both the syntactic and semantic challenges inherent to this process. The novel benchmark Decompile-Eval sets a new standard for evaluating decompilation tools, focusing on the practical usability of decompiled code. This work not only enhances the capabilities in decompilation but also opens new avenues for future research in applying LLMs to reverse engineering and code analysis tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Slade: A portable small language model decompiler for optimized assembler. CoRR, abs/2305.12520.
  2. Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. In Proceedings of the 22th USENIX Security Symposium, Washington, DC, USA, August 14-16, 2013, pages 353–368. USENIX Association.
  3. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  4. Modeling black-box components with probabilistic synthesis. In GPCE ’20: Proceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, Virtual Event, USA, November 16-17, 2020, pages 1–14. ACM.
  5. ANGHABENCH: A suite with one million compilable C benchmarks for code-size reduction. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021, Seoul, South Korea, February 27 - March 3, 2021, pages 378–390. IEEE.
  6. Ghidra. 2024. Ghidra software reverse engineering framework.
  7. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  8. Hex-Rays. 2024. Ida pro: a cross-platform multi-processor disassembler and debugger.
  9. Iman Hosseini and Brendan Dolan-Gavitt. 2022. Beyond the C: retargetable decompilation using neural machine translation. CoRR, abs/2212.08950.
  10. Nova++{}^{\mbox{+}}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT: Generative language models for binaries. CoRR, abs/2311.13721.
  11. Using recurrent neural networks for decompilation. In 25th International Conference on Software Analysis, Evolution and Reengineering, SANER 2018, Campobasso, Italy, March 20-23, 2018, pages 346–356. IEEE Computer Society.
  12. DIRE: A neural approach to decompiled identifier naming. In 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019, pages 628–639. IEEE.
  13. Compiler validation via equivalence modulo inputs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014, pages 216–226. ACM.
  14. Thomas Lippincott. 2020. Starcoder: A general neural ensemble technique to support traditional scholarship, illustrated with a study of the post-atlantic slave trade. In 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020, Ottawa, Canada, July 20-25, 2020, Conference Abstracts.
  15. Zhibo Liu and Shuai Wang. 2020. How far we have come: testing decompilation correctness of C decompilers. In ISSTA ’20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA, July 18-22, 2020, pages 475–487. ACM.
  16. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  17. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  18. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  19. Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
  20. Richard M Stallman et al. 2003. Using the gnu compiler collection. Free Software Foundation, 4(02).
  21. Codeflaws: a programming competition benchmark for evaluating automated program repair tools. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume, pages 180–182. IEEE Computer Society.
  22. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  23. A new algorithm for identifying loops in decompilation. In Static Analysis, 14th International Symposium, SAS 2007, Kongens Lyngby, Denmark, August 22-24, 2007, Proceedings, volume 4634 of Lecture Notes in Computer Science, pages 170–183. Springer.
  24. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
  25. Refining decompiled C code with large language models. CoRR, abs/2310.06530.
  26. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, page 1–10, New York, NY, USA. Association for Computing Machinery.
  27. Lmpa: Improving decompilation by synergy of large language model and program analysis. CoRR, abs/2306.02546.
  28. An extensive study on pre-trained models for program understanding and generation. In ISSTA ’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, South Korea, July 18 - 22, 2022, pages 39–51. ACM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hanzhuo Tan (5 papers)
  2. Qi Luo (61 papers)
  3. Jing Li (621 papers)
  4. Yuqun Zhang (13 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com