Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

Published 25 Feb 2019 in cs.LG, cs.CR, cs.PL, and stat.ML | (1902.09122v4)

Abstract: We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by 28% and by 100% over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero .

Abstract PDF Upgrade to Chat

Citations (13)

View on Semantic Scholar

Summary

The paper proposes a hybrid approach combining static analysis with neural networks to predict procedure names in stripped binaries.
It employs augmented control flow graphs to capture call site structures and synthetic argument details for improved analysis.
Experimental results show up to a 35% improvement in F1 score over baselines, underscoring the effectiveness of graph-based models.

Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

The paper "Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs" by Yaniv David, Uri Alon, and Eran Yahav addresses the challenge of reverse engineering (RE) stripped executables, which lack debugging information and present minimal syntactic cues due to compiler optimizations. This research proposes a novel methodology for predicting procedure names in such binaries, leveraging a combination of static analysis and neural network models.

The authors introduce an innovative representation of binary procedures that is essential for name prediction. The representation is derived through static analysis, yielding augmented control flow graphs (CFGs) that maintain the logical structure of call sites within the binaries. This encoding process facilitates the extraction of critical data from stripped executables, enabling neural architectures to perform effective name prediction tasks.

A key strength of the proposed method lies in its hybrid approach, combining insights from static analysis with the power of neural models. The representation strategy involves the reconstruction of call sites by analyzing the control flow within the binaries more deeply, focusing on extracting not only the API calls but also synthetic arguments using graph structures.

The experimental evaluation demonstrates significant improvements over current methodologies. The paper reports a notable increase in F1 score by 28% over DIRE and 35% over Debin when unifying their approach with three distinct neural architectures—LSTM, Transformer, and Graph Neural Networks (GNNs), showcasing a robust framework adaptable to various neural models.

The results attested that the integrations of the augmented CFG-based representation with neural networks outperform baselines that merely utilize raw assembly instructions or decompiled sequences without recognizing structural and commercial nuances. Among the three, the GNN-based model outperforms others, accentuating the efficacy of graph structured data in discerning runtime code paths—an essential aspect for accurate procedure name prediction.

Furthermore, the authors explore variations such as API obfuscation to solidify the approach's practical application in real-world scenarios involving anti-RE strategies. They conclude with an extensive ablation study, quantifying the contribution of different components—highlighting that the augmentation with abstract and concrete values critically drives the performance enhancements.

In speculating about future developments, it can be deduced that this research lays a groundwork that may significantly impact software security research, particularly in malware analysis, by providing enhanced automated tools for understanding software binaries. The clean and systematic integration with neural models presents a promising direction towards improving and refining reverse engineering practices without the necessity of costly manual analysis.

Overall, this research contributes substantially by introducing a framework that addresses key limitations in current approaches, ensuring that neural reverse engineering can act as a valuable ally in software analysis and security domains. The paper's insights provide a compelling case for the application of sophisticated static analysis into neural architectures, emphasizing the balance of expert-driven knowledge and machine learning's adaptability.

Markdown