Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary (2306.02546v4)

Published 5 Jun 2023 in cs.SE

Abstract: Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases. We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B. We finetune GenNm on decompiled functions and teach models to leverage contextual information. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. We mitigate model biases by aligning the output distribution of models with symbol preferences of developers. Our results show that GenNm improves the state-of-the-art name recovery precision by 5.6-11.4 percentage points on two commonly used datasets and improves the state-of-the-art by 32% (from 17.3% to 22.8%) in the most challenging setup where ground-truth variable names are not seen in the training dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xiangzhe Xu (14 papers)
  2. Zhuo Zhang (42 papers)
  3. Shiwei Feng (27 papers)
  4. Yapeng Ye (5 papers)
  5. Zian Su (10 papers)
  6. Nan Jiang (210 papers)
  7. Siyuan Cheng (41 papers)
  8. Lin Tan (25 papers)
  9. Xiangyu Zhang (328 papers)
  10. Ziyang Huang (23 papers)
  11. Danning Xie (6 papers)
Citations (3)