Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation

Published 20 Dec 2024 in cs.CR, cs.AI, and cs.CL | (2412.16135v3)

Abstract: Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a non-trivial, labor-intensive process. In this study, we ask the following question: Can LLMs potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk.

Abstract PDF HTML Upgrade to Chat

Authors (8)

Summary

The paper demonstrates that LLMs can obfuscate assembly code via dead code insertion, register substitution, and control flow change with a Delta Entropy of 10-20%.
The methodology uses the MetamorphASM dataset of 328,200 samples and employs zero-shot, few-shot, and in-context prompting to evaluate obfuscation performance.
The findings reveal cybersecurity risks and underscore the need for advanced detection strategies against LLM-generated malware obfuscation.

A Systematic Analysis of LLMs in Assembly Code Obfuscation

The paper "Can LLMs Obfuscate Code? A Systematic Analysis of LLMs into Assembly Code Obfuscation" systematically explores the potential of LLMs in generating obfuscations for assembly code. The core motivation is to discern whether LLMs can serve as tools for malware authors to obfuscate code, posing significant challenges for cybersecurity defenses, particularly antivirus engines. This study introduces the MetamorphASM benchmark, featuring the MetamorphASM Dataset (MAD) along with three primary code obfuscation techniques: insertion of dead code, register substitution, and control flow change. These methodologies are critical in obfuscating assembly-level code, traditionally a labor-intensive process requiring significant expertise in low-level programming.

Objectives and Dataset

The researchers developed MAD, comprising 328,200 obfuscated assembly code samples. This dataset provides a comprehensive platform for evaluating LLMs on their ability to generate obfuscated code. It addresses a notable gap in the availability of resources tailored for assembly code transformation and obfuscation analysis. The dataset is structured to assess the resilience of existing code detection mechanisms and evaluate LLMs' generative abilities at the assembly level. Three forms of obfuscation are meticulously considered:

Dead Code Insertion: Introducing irrelevant code segments that do not alter program functionality, complicating static analysis techniques.
Register Substitution: Altering register usages to obscure the underlying code structure, maintaining semantic equivalence.
Control Flow Change: Rearranging instruction sequences to disrupt conventional linear code interpretation paths.

Evaluation of LLMs

The study evaluates several prominent LLMs, including proprietary models like GPT-3.5 and GPT-4o-mini, and open-source alternatives like CodeLlama, Starcoder, and CodeT5. The assessment utilizes both zero-shot and few-shot prompting techniques alongside in-context learning to gauge the models' competence in generating valid obfuscation patterns. Information-theoretic metrics such as Delta Entropy and Cosine Similarity are employed to assess the degree of obfuscation and structural similarity between original and obfuscated code.

Results and Implications

Results indicate that LLMs like GPT-4o-mini and DeepSeekCoder-v2 effectively perform dead code insertion and control flow changes, demonstrating their feasibility in obfuscating assembly code. The key performance metric, the Delta Entropy, for effective obfuscation is established between 10-20%, corroborated by high cosine similarity values indicating preserved functional equivalency.

The study reveals that while LLMs currently show potential in assembly code obfuscation, their output can vary based on the complexity of the code and specific obfuscation technique. This variability highlights the complexity involved in maintaining functional consistency while altering code structure. Moreover, the research suggests improvements in the methods used to test the capabilities of LLMs in this context, especially concerning real-time adaptability and robustness against pragmatic anti-obfuscation tactics.

Theoretical and Practical Implications

From a theoretical perspective, this research expands the understanding of LLMs' abilities beyond natural language processing into the domain of code obfuscation. It implies a paradigm wherein LLMs can function as automatic code obfuscators, requiring minimal human intervention post-training—a significant shift from traditional static obfuscation engines that are often platform-dependent and costly to maintain.

Practically, this paper raises awareness of potential risks associated with LLMs in cybersecurity contexts, particularly concerning their misuse in developing malware that is dynamically obfuscated. It prompts future investigations into more sophisticated LLMs capable of generating even more intricate obfuscation patterns. Furthermore, it underscores the necessity for advanced detection mechanisms that incorporate machine learning-based defenses against such evolving threats.

In conclusion, this study underscores the emerging capability of LLMs in the domain of code obfuscation, encouraging both future research into advanced LLM frameworks and the simultaneous development of enhanced detection strategies to mitigate potential risks posed by these powerful LLMs.

Markdown Report Issue