Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs (1808.04706v2)

Published 8 Aug 2018 in cs.SE, cs.CL, cs.CR, and cs.PL

Abstract: Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from NLP, a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.

Citations (215)

View on Semantic Scholar

Summary

The paper presents a neural network-based cross-lingual basic-block embedding model inspired by Neural Machine Translation.
It tackles two challenges: measuring semantic similarity across different ISAs and detecting code containment using control flow graphs and LCS.
Evaluation shows up to 98% AUC accuracy and drastic speed improvements over traditional symbolic execution approaches.

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

The paper "Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs" tackles the challenging problem of cross-architecture binary code similarity comparison. This research proposes a novel method that draws from the domain of NLP, specifically Neural Machine Translation (NMT). The approach conceptualizes instructions as analogous to words and basic blocks as analogous to sentences. This analogy allows the use of NLP techniques such as word embeddings and Long Short-Term Memory (LSTM) networks to analyze binary codes, much like translation tasks involve analyzing sentences across different human languages.

Core Contributions

Cross-Lingual Basic-Block Embedding Model: The work introduces a neural network-based model to compare the similarity of basic blocks compiled for different Instruction Set Architectures (ISAs). The model uses a Siamese network architecture with LSTM to automatically learn and generate embeddings for basic blocks, capturing their semantic meaning comprehensively. This contrasts with conventional methods that depend on manually selected features.
Problem Definition and Solution: The paper addresses two distinct problems:
- Determining semantic similarity between basic blocks from binaries compiled for different ISAs.
- Detecting code containment where a code component, which might span multiple functions or be part of a function, is identified in a different binary compiled for another ISA. For this task, the paper presents InnerEye-CC, an extension that employs path exploration over control flow graphs (CFGs) using Longest Common Subsequence (LCS) calculations to compare code components.
Implementation and Evaluation: The authors implemented a prototype system named InnerEye and conducted comprehensive evaluations. InnerEye outperformed existing approaches for cross-architecture basic block comparison in terms of accuracy, efficiency, and scalability. The application and case paper results showcased the method's potential in effectively identifying code similarities across architectures.

Numerical Outcomes and Claims

The research demonstrates high accuracy with an Area Under the Curve (AUC) score of up to 98% for block comparison. This is achieved through a systematic evaluation using diverse datasets compiled across different architectures and optimization levels.
The proposed methods significantly reduce computation time, offering a reported speedup of 3700x to 140000x compared to symbolic execution-based approaches traditionally used for similar tasks.

Theoretical and Practical Implications

The presented approach bridges the gap between binary analysis and NLP, offering a more efficient and precise method for binary similarity detection without relying on source code access. The neural embeddings provide a nuanced view of binary semantics beyond manually crafted features, which tend to miss instruction dependencies and semantic nuances.

Practically, this opens doors for enhanced applications in areas like cross-architecture vulnerability discovery and code plagiarism detection. The ability to analyze binary code efficiently and accurately across differing architectures promises advancements in cybersecurity, especially in detecting vulnerabilities and malware across IoT devices with diverse hardware and software configurations.

Speculation on Future Developments

The successful adaptation of NLP techniques in binary analysis suggests broader applications of machine learning to various areas of computer science traditionally restrained by computational costs and complexity. The scalability of this method could further enable real-time vulnerability detection in complex distributed systems, and hybrid models combining symbolic analysis may emerge, capitalizing on strengths from multiple approaches.

In conclusion, this research represents a significant step in adapting machine learning techniques to solve complex problems in binary analysis, with far-reaching implications for securing software across diverse hardware ecosystems.

PDF Markdown