Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XDA: Accurate, Robust Disassembly with Transfer Learning (2010.00770v3)

Published 2 Oct 2020 in cs.CR and cs.LG

Abstract: Accurate and robust disassembly of stripped binaries is challenging. The root of the difficulty is that high-level structures, such as instruction and function boundaries, are absent in stripped binaries and must be recovered based on incomplete information. Current disassembly approaches rely on heuristics or simple pattern matching to approximate the recovery, but these methods are often inaccurate and brittle, especially across different compiler optimizations. We present XDA, a transfer-learning-based disassembly framework that learns different contextual dependencies present in machine code and transfers this knowledge for accurate and robust disassembly. We design a self-supervised learning task motivated by masked LLMing to learn interactions among byte sequences in binaries. The outputs from this task are byte embeddings that encode sophisticated contextual dependencies between input binaries' byte tokens, which can then be finetuned for downstream disassembly tasks. We evaluate XDA's performance on two disassembly tasks, recovering function boundaries and assembly instructions, on a collection of 3,121 binaries taken from SPEC CPU2017, SPEC CPU2006, and the BAP corpus. The binaries are compiled by GCC, ICC, and MSVC on x86/x64 Windows and Linux platforms over 4 optimization levels. XDA achieves 99.0% and 99.7% F1 score at recovering function boundaries and instructions, respectively, surpassing the previous state-of-the-art on both tasks. It also maintains speed on par with the fastest ML-based approach and is up to 38x faster than hand-written disassemblers like IDA Pro. We release the code of XDA at https://github.com/CUMLSec/XDA.

Citations (53)

Summary

  • The paper introduces XDA, a transfer learning framework that enhances binary disassembly accuracy by mimicking masked language modeling techniques.
  • It employs a two-phase approach—pretraining with masked LM followed by task-specific finetuning—to accurately recover function boundaries and assembly instructions.
  • Experimental results show XDA achieving F1 scores of 99.0% and 99.7%, while operating up to 38 times faster than traditional disassemblers.

A Transfer Learning Approach for Robust and Accurate Disassembly: The XDA Framework

The academic paper entitled "XDA: Accurate, Robust Disassembly with Transfer Learning" presents a sophisticated framework for the disassembly of stripped binaries, a process essential in multiple binary analysis contexts such as reverse engineering and malware analysis. The difficulty in disassembly arises primarily from the lack of high-level structures like instruction boundaries in stripped binaries, which must be inferred.

The proposed framework, XDA (Xfer-learning DisAssembler), leverages transfer learning to achieve this by incorporating a novel self-supervised learning paradigm inspired by the masked LLMing (LM) techniques used in natural language processing. This approach enables the framework to derive sophisticated contextual dependencies between byte sequences within binaries, thus achieving a higher degree of accuracy and robustness in identifying function boundaries and assembly instructions.

Methodology

The research introduces a two-phase learning process:

  1. Pretraining with Masked LM: In the pretraining phase, XDA uses a masked LM task where some bytes in binaries are masked, and the model predicts these bytes based on their context. This is analogous to LLMs like BERT where words are masked and predicted. Here, bytes serve as the words, and the context bytes provide information to infer the masked byte's identity.
  2. Finetuning for Specific Disassembly Tasks: Once pretrained, the model is finetuned on specific tasks like identifying function boundaries and recovering instructions. This finetuning process allows the model to leverage foundational byte semantics learned during pretraining for precise disassembly outputs.

The methodological strength of XDA lies in its self-attention mechanism and the Transformer encoder architecture, which facilitates comprehensive and context-aware byte representations.

Experimental Results and Performance

The authors provide an extensive evaluation of XDA's performance across 3,121 binaries from the SPEC CPU2017, SPEC CPU2006, and BAP datasets. The binaries cover various architectures (x86/x64) and platforms (Windows and Linux), and include four optimization levels. In empirical evaluations:

  • Function Boundary Recovery: XDA achieved an average F1 score of 99.0%, exceeding the previous state-of-the-art methods by 17.2%.
  • Assembly Instruction Recovery: XDA reached an F1 score of 99.7%.

Moreover, XDA demonstrated remarkable efficiency, being up to 38 times faster than traditional handwritten disassemblers like IDA Pro.

Implications and Future Directions

The research highlights the potential of leveraging large, unlabeled datasets for pretraining to significantly enhance the efficacy of models in specialized tasks like binary disassembly. This approach not only surpasses traditional heuristic-driven methods in accuracy and robustness but also scales efficiently with larger datasets.

From a theoretical standpoint, this work underlines the effectiveness of transfer learning in domains outside of natural language processing, specifically within binary analysis, which relies heavily on understanding structural dependencies in data.

Practically, the XDA framework can transform binary analysis workflows, especially in environments with diverse compiler and optimization scenarios. As the capability to accurately handle obfuscated binaries was also demonstrated, the method has clear applications in cybersecurity.

The open-source release of XDA offers a platform for further research and development, potentially prompting advancements in related tasks such as control-flow integrity, software patch analysis, and even more general reverse engineering tasks. Future research could extend the applicability of this model to other architectures or explore its integration into compilers for real-time optimization and security reinforcement.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub