An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding (2504.21803v1)

Published 30 Apr 2025 in cs.SE and cs.CR

Abstract: Binary code analysis plays a pivotal role in the field of software security and is widely used in tasks such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, reverse engineers face significant challenges in understanding binary code due to the lack of intuitive semantic information. Although traditional reverse tools can convert binary code into C-like pseudo code, the lack of code comments and symbolic information such as function names still makes code understanding difficult. In recent years, two groups of techniques have shown promising prospects: (1) Deep learning-based techniques have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) LLMs have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This has left participants wondering about the capabilities of LLMs in binary code understanding. To this end, this work proposes a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios, which covers two key binary code understanding tasks, i.e., function name recovery and binary code summarization. To more comprehensively evaluate, we include binaries with multiple target architectures as well as different optimization options. We gain valuable insights into the capabilities and limitations through extensive empirical studies of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding, and provide new directions for binary code analysis techniques.

Summary

The paper demonstrates that LLMs, especially CodeLlama-34b and ChatGPT, outperform traditional deep learning models in function name recovery and binary code summarization via zero-shot prompting.
It employs a benchmark dataset from 12 real-world C projects covering 14,000 functions across diverse architectures and optimization levels, using metrics like F1 and Rouge-L.
The study finds that fine-tuning and carefully selected few-shot prompts significantly enhance performance, underscoring the need for domain-specific adaptations in binary analysis.

This paper presents an empirical paper on the effectiveness of LLMs for understanding binary code, focusing on two key tasks: function name recovery and binary code summarization. Binary code analysis is crucial in software security, but understanding stripped binaries is challenging due to the lack of semantic information like function names and comments. While traditional tools and deep learning models have been applied, their generalization capabilities are often limited. LLMs, with their strong natural language and code understanding abilities, offer a new avenue for tackling this problem.

To evaluate LLMs, the authors constructed a benchmark dataset derived from 12 real-world C projects, covering various domains (crypto, compress, network, etc.). The dataset includes aligned source code, decompiled pseudo code (using IDA Pro), and natural language summaries. Binaries were generated for multiple target architectures (x64, x86, ARM, MIPS) and compiler optimization levels (O0-O3 for x64), and then stripped of symbols to simulate real-world scenarios. Ground truth function names were extracted from source code, and high-quality natural language summaries were generated using ChatGPT, validated by domain experts. The final benchmark comprises 14,000 pieces of data (2,000 functions across 7 compilation settings). Care was taken to ensure the benchmark data did not overlap with the training data of the evaluated LLMs.

The evaluation included eight code-domain LLMs (e.g., CodeLlama, WizardCoder, DeepSeek-Coder) and eight general-domain LLMs (e.g., ChatGPT, Llama-2, Mistral, Vicuna). Four deep learning-based expert models (SymLM, NER, BinT5, HexT5) were included as baselines. The LLMs were evaluated using zero-shot prompting, adopting a role-play format to instruct them as reverse engineers. For function name recovery, token-level Precision, Recall, and F1-score were used as metrics. For binary code summarization, BLEU-4, METEOR, and Rouge-L were used. The evaluation was conducted on a machine equipped with 8 NVIDIA RTX A6000 GPUs.

Evaluation Results:

Function Name Recovery (RQ1): CodeLlama-34b performed best among LLMs across all architectures, achieving F1-scores around 26-30. Code-domain LLMs generally performed slightly better than general-domain LLMs. Performance was notably better on the MIPS architecture compared to x64, x86, and ARM, possibly due to its simpler instruction set. The impact of compiler optimization levels (O0-O3) on performance was minimal. The DL-based methods, particularly NER, performed better than the worst-performing LLMs but significantly worse than the top LLMs, showing limited generalization compared to LLMs' zero-shot capabilities. Inference time for LLMs ranged from 1 to 10 seconds per data point depending on model size, while DL models were much faster (around 0.03-0.28 seconds).
Binary Code Summarization (RQ2): ChatGPT achieved the best performance among all models across all architectures and optimization levels, with Rouge-L scores around 21-23. General-domain LLMs generally performed significantly better than code-domain LLMs, attributed to their stronger natural language generation capabilities. Similar to function name recovery, performance was best on MIPS. Optimization levels had minimal impact. DL-based summarization models (BinT5, HexT5) performed substantially worse than LLMs. Inference time for summarization was generally longer than for name recovery, ranging from 3.4 to 49 seconds for LLMs.

Factors Affecting Performance (RQ3):

Few-shot Prompts: Introducing a few examples in the prompt generally improved performance for both tasks, especially for function name recovery. However, it also increased prompt length, potentially leading to truncation issues for models with smaller context windows.
Pseudo Code Length: For function name recovery, performance was best for moderate pseudo code lengths (400-2000 tokens), dropping for very short or very long inputs. For binary code summarization, performance slowly decreased as pseudo code length increased, likely due to increasing complexity and potential truncation.
Symbol Information Length: Longer symbol information (strings, identifiers) in the pseudo code significantly improved performance for both tasks, highlighting its crucial role in helping LLMs understand semantics.

Fine-Tuning (RQ4):

Fine-tuning selected 7b-15b LLMs on a binary code dataset (from GNU projects) led to considerable performance improvements for both function name recovery and binary code summarization. The general-domain Llama-2-13b-chat-hf showed the most significant gains after fine-tuning. This indicates that injecting binary domain knowledge is effective in enhancing LLM capabilities for these tasks.

Case Study (RQ5):

Case studies analyzing stripped malware binaries (splinter, TrojanCockroach) demonstrated that ChatGPT can effectively assist reverse engineers by generating functional summaries and suggesting function names from decompiled pseudo code. Incorporating calling context information significantly improved the quality of the analysis and predictions.

Discussions and Limitations:

The paper highlights the potential of LLMs in binary code understanding but also points out areas for future work: developing domain-specific LLMs for binary code, extending context windows to handle complex code, enhancing understanding without symbolic information (e.g., via static/dynamic analysis), integrating multi-modal information, improving transfer learning across platforms, and increasing robustness against obfuscation. Limitations include the potential inadequacy of current NLP metrics for evaluating binary code summaries, the lack of evaluation on obfuscated binaries, and the focus primarily on individual function understanding rather than inter-function relationships.

In conclusion, LLMs show promising capabilities for binary code understanding tasks like function name recovery and summarization, often outperforming traditional DL methods in zero-shot scenarios. Factors like few-shot prompting, pseudo code length, and symbol information influence performance, and fine-tuning on binary data can significantly enhance results. While challenges remain, particularly with complex or obfuscated binaries, LLMs are a valuable tool with potential to greatly improve the efficiency of binary analysis for reverse engineers.

PDF Markdown

Tweets

https://twitter.com/ComputerPapers/status/1918029153414132158

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding (2504.21803v1)

Summary

Related Papers

Tweets