Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning (2311.13721v5)

Published 22 Nov 2023 in cs.SE and cs.AI

Abstract: Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. LLMs although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 14.84 -- 21.58% (absolute percentage point improvement) higher Pass@1 and Pass@10, and outperforms the latest binary code similarity detection techniques by up to 6.17% Recall@1, showing promising abilities on both assembly generation and understanding tasks.

PDF Abstract

Summary of "0.9: Generative LLMs for Binaries"

The paper "0.9: Generative LLMs for Binaries" contributes to the emerging intersection of generative LLMs and binary code analysis by introducing two specialized LLMs, Nova and 0.9, pre-trained on binary corpora. Despite the prolific use of LLMs in software engineering domains, they typically rely heavily on high-level programming languages and tend to falter when applied to binary code due to the challenges posed by hex-decimal values, global dependencies, and diverse compiler optimization levels.

Key Contributions

Nova and 0.9 LLMs: Nova is the first generative LLM specifically pre-trained on a binary code corpus rather than high-level language code, addressing the lack of LLMs specialized in binary code. 0.9 builds on Nova by incorporating additional pre-training tasks—optimization generation and optimization level prediction—that enhance its understanding and handling of binary optimizations.
Pre-Training Methodology: The unique aspect of this work is its approach to the challenges inherent in binary code:
- Normalization: Binary functions are normalized by replacing instruction addresses and function calls with placeholders, making them more self-contained and feasible for LLMs to learn.
- Optimization Generation and Prediction Tasks: These pre-training tasks are crafted to enable the model to learn equivalences between binaries across different optimization levels, imitating compiler optimizations.
Downstream Task Evaluation: The effectiveness of Nova and 0.9 is demonstrated through performance evaluations on three downstream tasks relevant to binary code analysis:
- Binary Code Similarity Detection (BCSD): These models outperformed the state-of-the-art techniques, such as jTrans, especially when faced with large candidate pools.
- Binary Code Translation (BCT): Nova and 0.9 demonstrated superior capabilities over GPT-3.5 in translating between X86-64 and ARM64 architectures.
- Binary Code Recovery (BCR): The models are shown to be very effective, often reconstructing source code with higher accuracy than existing baselines like Coda and Neutron.

Numerical Results and Implications

Most notably, 0.9 shows significant improvements in MRR scores in binary similarity detection tasks, indicating robust handling of optimizations and compiler-induced variations across functionally equivalent binaries. The introduction of an LLM like 0.9, which can work effectively with binary code, has substantial implications for automating tasks like vulnerability detection, software porting, and forensic analysis, traditionally reliant on heuristic techniques or manually intensive processes.

Future Directions

The research identifies potential avenues for extending the models’ capabilities to learn more complex global dependencies within binaries. Future advancements could involve integrating whole-executable learning capacities or expanding architecture coverage beyond X86-64 and ARM64 to other binary formats. Additionally, refining the handling of real-world challenges, such as obfuscation or packed binaries, remains a key development target, enabling these LLMs to tackle more sophisticated security and reverse engineering tasks.

This groundbreaking work lays a foundation for further exploration in machine learning applications pertinent to low-level software artifacts and could substantially impact the domains of reverse engineering, malware analysis, and beyond.