SAFE: Self-Attentive Function Embeddings for Binary Similarity (1811.05296v4)

Published 13 Nov 2018 in cs.CR and cs.LG

Abstract: The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first transforming their binary code in multi-dimensional vector representations (embeddings), and then comparing vectors through simple and efficient geometric operations. However, embeddings are usually derived from binary code using manual feature extraction, that may fail in considering important function characteristics, or may consider features that are not important for the binary similarity problem. In this paper we propose SAFE, a novel architecture for the embedding of functions based on a self-attentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions (i.e., it does not incur in the computational overhead of building or manipulating control flow graphs), and is more general as it works on stripped binaries and on multiple architectures. We report the results from a quantitative and qualitative analysis that show how SAFE provides a noticeable performance improvement with respect to previous solutions. Furthermore, we show how clusters of our embedding vectors are closely related to the semantic of the implemented algorithms, paving the way for further interesting applications (e.g. semantic-based binary function search).

Citations (162)

View on Semantic Scholar

Summary

The paper introduces SAFE, a novel method using self-attentive neural networks to create binary function embeddings directly from disassembled code, bypassing traditional manual feature engineering.
SAFE's architecture processes instruction sequences with a GRU RNN and attention, demonstrating superior performance over existing methods like Gemini in accuracy and efficiency across various tasks.
The self-attentive embeddings enable effective real-world applications like semantic function search and high-precision vulnerability detection across different compilers and platforms.

SAFE: Advancements in Binary Similarity Through Self-Attentive Function Embeddings

The paper "SAFE: Self-Attentive Function Embeddings for Binary Similarity" addresses the computational challenge of determining similarities between binary functions, a problem with significant implications in areas such as malware analysis and vulnerability detection. The research introduces a novel approach named SAFE, which employs self-attentive neural networks to generate embeddings for binary functions without relying on manual feature extraction or control flow graph computations, thereby enhancing the efficiency and applicability of binary similarity detection.

Key Insights and Methodology

SAFE presents an architecture that constructs embeddings directly from disassembled binary functions, sidestepping traditional methods that depend on manual feature selections and control flow graphs. This approach is advantageous in several respects:

Efficiency and Generality: SAFE operates without the computational overhead linked to constructing control flow graphs, making it notably faster and able to handle stripped binaries across multiple architectures.
Self-Attentive Neural Network Architecture: The core innovation lies in utilizing a self-attentive network, specifically a GRU RNN combined with an attention mechanism, to process binary instruction sequences akin to handling natural language. This allows the model to weigh instructions dynamically based on their relevance to similarity detection.
Semantic Embeddings: The research finds that clusters formed by their embeddings correlate strongly with the algorithmic semantics of the binary functions, offering promising avenues for semantic-based function search.

Experimental Validation

The capability of SAFE to outperform existing methods was evidenced through several tasks:

Single and Cross-Platform Tests: SAFE consistently demonstrated superior performance in ROC AUC metrics compared to Gemini, a leading similar solution, by achieving near-perfect predictive accuracy.
Function Search and Vulnerability Detection: SAFE was effective in real-world scenarios, maintaining a high precision and recall in locating similar functions and identifying vulnerable functions across large datasets compiled with distinct compilers.
Semantic Classification: By training a classifier on the embeddings, SAFE achieved a classification accuracy of 95% on functions categorized into four algorithmic classes, confirming its utility in semantic detection.

Implications and Future Directions

The implications of this paper are manifold. Practically, SAFE enhances the efficiency and applicability of binary similarity detection, potentially leading to faster and more accurate identification of vulnerabilities and malicious software. Theoretically, it opens new avenues for applying neural networks to interpret executable code structures, bolstering the field of binary analysis with machine learning insights.

From a future research perspective, exploring further integration of symbolic information (such as dynamic library symbols) could refine the semantic classification capabilities. Additionally, extending the approach within real-time or resource-constrained environments, such as mobile platforms, could further benefit practical scenarios in cybersecurity.

This work demonstrates notable advancements in embedding technologies for binary analysis, positioning SAFE as a valuable tool for researchers and practitioners in the field. The paper offers comprehensive evaluations of the model's performance in diverse tasks, underscoring its potential in advancing both the efficiency and utility of binary function similarity assessment.