LLM4Decompile: Decompiling Binary Code with Large Language Models (2403.05286v3)

Published 8 Mar 2024 in cs.PL and cs.CL

Abstract: Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in LLMs, we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

PDF HTML Abstract

Decompiling Binary Code with LLMs: Introducing LLM4Decompile

Introduction to Decompilation and LLMs

Decompilation, the process of translating binary or bytecode back into human-readable source code, poses significant challenges, particularly in terms of preserving details like variable names and structural elements such as loops. Meanwhile, the advancement in LLMs for programming tasks suggests their potential utility in decompilation. As a pioneering effort, we present LLM4Decompile, the first open-source LLM specifically designed for decompilation, pre-trained on a substantial dataset of C source code and corresponding assembly instructions. Additionally, we introduce Decompile-Eval, a novel benchmark focusing on evaluating decompiled code based on re-compilability and re-executability, crucial indicators of a successful decompilation that were previously overlooked.

Key Challenges in Decompilation

Traditional decompilation tools often struggle with generating human-readable code that resembles the original source code in readability and structure. This is due to the inherent difficulty of reversing the compilation process, which loses certain information. Despite some success with Transformer-based models in addressing these issues, their limited size and lack of public availability have constrained their effectiveness and broader application. Furthermore, the absence of a standard benchmark for evaluating decompilation has impeded coherent progress in this field.

Introducing LLM4Decompile and Decompile-Eval

To remedy these limitations, we release LLM4Decompile, pre-trained LLMs ranging from 1B to 33B parameters, tailored for decompilation tasks. These models are trained on a dataset of 4 billion tokens, consisting of C source code and the corresponding assembly code compiled with various optimization levels. Alongside, Decompile-Eval is proposed as the first benchmark focused on re-compilability and re-executability of decompiled code, pioneering a more relevant evaluation framework for decompilation that prioritizes program semantics.

Evaluation and Results

Our models demonstrate a significant improvement over existing decompilation approaches, with our 6B LLM4Decompile achieving 87% re-compilability and 21% re-executability on Decompile-Eval. These figures indicate a comprehensive understanding of both the syntax and semantics of the code, surpassing the GPT-4 model significantly in these contexts.

Methodology

We compile C code into assembly using the GCC compiler across different optimization levels, and fine-tune the DeepSeek-Coder model on these assembly-source pairs. Our evaluation on Decompile-Eval assesses both the syntactic integrity and semantic accuracy of the decompiled code. The methodology adopted demonstrates that focusing on the sequence-to-sequence prediction significantly enhances the model's decompilation capabilities compared to other training strategies.

Theoretical and Practical Implications

Our research establishes a foundation for the application of LLMs in decompilation, significantly advancing the state-of-the-art. The introduction of Decompile-Eval as a benchmark directs future research towards more accurately assessing the practical utility of decompiled code. On a broader level, this work illuminates the path for applying large-scale LLMs to complex reverse-engineering tasks, potentially transforming practices in software maintenance, security analysis, and intellectual property evaluation.

Future Directions

The current scope is limited to C language and x86 architecture, focusing on decompiling single functions without considering external dependencies and cross-references. Future work could extend LLM4Decompile's methodology to other programming languages and architectural platforms, and address the complexities of decompiling entire software applications. This would encompass developing models that can accurately interpret and reconstruct the high-level constructs of complex software systems.

Conclusion

LLM4Decompile represents the forefront of leveraging LLMs for the task of decompilation, addressing both the syntactic and semantic challenges inherent to this process. The novel benchmark Decompile-Eval sets a new standard for evaluating decompilation tools, focusing on the practical usability of decompiled code. This work not only enhances the capabilities in decompilation but also opens new avenues for future research in applying LLMs to reverse engineering and code analysis tasks.