STRIDE: Simple Type Recognition In Decompiled Executables (2407.02733v1)
Abstract: Decompilers are widely used by security researchers and developers to reverse engineer executable code. While modern decompilers are adept at recovering instructions, control flow, and function boundaries, some useful information from the original source code, such as variable types and names, is lost during the compilation process. Our work aims to predict these variable types and names from the remaining information. We propose STRIDE, a lightweight technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming while being much simpler and faster. We perform a detailed comparison with two recent SOTA transformer-based models in order to understand the specific factors that make our technique effective. We implemented STRIDE in fewer than 1000 lines of Python and have open-sourced it under a permissive license at https://github.com/hgarrereyn/STRIDE.
- “Hex-rays decompiler,” Available at https://hex-rays.com/decompiler/ (2023/02/06).
- K. Yakdan, S. Dechand, E. Gerhards-Padilla, and M. Smith, “Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study,” in IEEE Symposium on Security and Privacy, 2016.
- D. Votipka, S. M. Rabin, K. Micinski, J. S. Foster, and M. M. Mazurek, “An observational investigation of reverse engineers’ processes,” in Proceedings of the USENIX Security Symposium, 2020.
- L. Ďurfina, J. Křoustek, and P. Zemek, “Psybot malware: A step-by-step decompilation case study,” in Proceedings of the Working Conference on Reverse Engineering, 2013.
- M. J. Van Emmerik, “Single static assignment for decompilation,” Ph.D. dissertation, The University of Queensland School of Information Technology and Electrical Engineering, May 2007. [Online]. Available: http://vanemmerikfamily.com/mike/master.pdf
- K. Yakdan, S. Eschweiler, E. Gerhards-Padilla, and M. Smith, “No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations.” in Network and Distributed System Security Symposium, 2015.
- K. Burk, F. Pagani, C. Kruegel, and G. Vigna, “Decomperson: How humans decompile and what we can learn from it,” in USENIX Security Symposium, 2022, pp. 2765–2782.
- A. Mantovani, L. Compagna, Y. Shoshitaishvili, and D. Balzarotti, “The convergence of source code and binary vulnerability discovery–a case study,” in Proceedings of the ACM Asia Conference on Computer and Communications Security, 2022.
- S. Kalle, N. Ameen, H. Yoo, and I. Ahmed, “Clik on plcs! attacking control logic with decompilation and virtual plc,” in Proceedings of the Binary Analysis Research Workshop, Network and Distributed System Security Symposium, 2019.
- Z. Liu and S. Wang, “How far we have come: Testing decompilation correctness of c decompilers,” in Proceedings of the ACM International Symposium on Software Testing and Analysis, 2020.
- M. Van Emmerik and T. Waddington, “Using a decompiler for real-world source recovery,” in Proceedings of the Working Conference on Reverse Engineering, 2004.
- A. Jaffe, J. Lacomis, E. J. Schwartz, C. L. Goues, and B. Vasilescu, “Meaningful variable names for decompiled code: A machine translation approach,” in Proceedings of the IEEE/ACM International Conference on Program Comprehension, 2018.
- C. Cifuentes, “Partial automation of an integrated reverse engineering environment of binary code,” in Proceedings of the Working Conference on Reverse Engineering, 1996.
- A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina, “SmartDec: approaching C++ decompilation,” in Proceedings of the Working Conference on Reverse Engineering, 2011.
- A. Mycroft, “Type-based decompilation,” in European Symposium on Programming, Mar. 1999.
- P. Chapman, J. Burket, and D. Brumley, “PicoCTF: A game-based computer security competition for high school students,” in Proceedings of the USENIX Summit on Gaming, Games, and Gamification in Security Education (3GSE 14), 2014.
- T. J. Burns, S. C. Rios, T. K. Jordan, Q. Gu, and T. Underwood, “Analysis and exercises for engaging beginners in online CTF competitions for security education.” in Proceedings of the USENIX Workshop on Advances in Security Education, 2017.
- J. Song and J. Alves-Foss, “The DARPA cyber grand challenge: A competitor’s perspective,” IEEE Security & Privacy, vol. 13, no. 6, pp. 72–76, 2015.
- J. Caballero and Z. Lin, “Type inference on executables,” ACM Computing Surveys (CSUR), vol. 48, no. 4, pp. 1–35, 2016.
- J. Lee, T. Avgerinos, and D. Brumley, “TIE: Principled reverse engineering of types in binary programs,” in Proceedings of the Network and Distributed System Security Symposium, 2011.
- E. J. Schwartz, J. Lee, M. Woo, and D. Brumley, “Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring,” in Proceedings of the USENIX Security Symposium, 2013.
- K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua, “Scalable variable and data type detection in a binary rewriter,” in Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, 2013, pp. 51–60.
- M. Noonan, A. Loginov, and D. Cok, “Polymorphic type inference for machine code,” in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, pp. 27–41.
- Z. Lin, X. Zhang, and D. Xu, “Automatic reverse engineering of data structures from binary execution,” in Proceedings of the Network and Distributed System Security Symposium, 2010.
- A. Slowinska, T. Stancescu, and H. Bos, “Howard: A dynamic excavator for reverse engineering data structures.” in NDSS, 2011.
- Z. Zhang, Y. Ye, W. You, G. Tao, W.-c. Lee, Y. Kwon, Y. Aafer, and X. Zhang, “Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary,” in 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021, pp. 813–832.
- J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 1667–1680.
- A. Maier, H. Gascon, C. Wressnegger, and K. Rieck, “Typeminer: Recovering types in binary programs using machine learning,” in Detection of Intrusions and Malware, and Vulnerability Assessment: 16th International Conference, DIMVA 2019, Gothenburg, Sweden, June 19–20, 2019, Proceedings 16. Springer, 2019, pp. 288–308.
- L. Chen, Z. He, and B. Mao, “Cati: Context-assisted type inference from stripped binaries,” in 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2020, pp. 88–98.
- Q. Chen, J. Lacomis, E. J. Schwartz, C. Le Goues, G. Neubig, and B. Vasilescu, “Augmenting decompiler output with learned variable names and types,” in USENIX Security Symposium, 2022, pp. 4327–4343.
- J. Lacomis, P. Yin, E. Schwartz, M. Allamanis, C. Le Goues, G. Neubig, and B. Vasilescu, “Dire: A neural approach to decompiled identifier naming,” in IEEE/ACM International Conference on Automated Software Engineering. IEEE, 2019, pp. 628–639.
- V. Nitin, A. Saieva, B. Ray, and G. Kaiser, “Direct: A transformer-based model for decompiled identifier renaming,” in Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 2021, pp. 48–57.
- J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “HexT5: Unified pre-training for stripped binary code information inference,” in IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 774–786.
- K. K. Pal, A. P. Bajaj, P. Banerjee, A. Dutcher, M. Nakamura, Z. L. Basque, H. Gupta, S. A. Sawant, U. Anantheswaran, Y. Shoshitaishvili et al., ““len or index or count, anything but v1”: Predicting variable names in decompilation output with transfer learning,” in IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2024, pp. 152–152.
- P. Hu, R. Liang, and K. Chen, “DeGPT: Optimizing decompiler output with LLM,” in Network and Distributed System Security Symposium, 2024.
- X. Xu, Z. Zhang, S. Feng, Y. Ye, Z. Su, N. Jiang, S. Cheng, L. Tan, and X. Zhang, “Lmpa: Improving decompilation by synergy of large language model and program analysis,” arXiv preprint arXiv:2306.02546, 2023.
- H. Tan, Q. Luo, J. Li, and Y. Zhang, “Llm4decompile: Decompiling binary code with large language models,” arXiv preprint arXiv:2403.05286, 2024.
- C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
- G. Neubig and C. Dyer, “Generalizing and hybridizing count-based and neural language models,” arXiv preprint arXiv:1606.00499, 2016.
- Y. Wainakh, M. Rauf, and M. Pradel, “Idbench: Evaluating semantic representations of identifier names in source code,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 562–573.
- Q. Chen, J. Lacomis, E. J. Schwartz, G. Neubig, B. Vasilescu, and C. L. Goues, “Varclr: Variable semantic representation pre-training via contrastive learning,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 2327–2339.
- H. Pearce, B. Tan, P. Krishnamurthy, F. Khorrami, R. Karri, and B. Dolan-Gavitt, “Pop quiz! can a large language model help with reverse engineering?” arXiv preprint arXiv:2202.01142, 2022.
- F. Wu, Q. Zhang, A. P. Bajaj, T. Bao, N. Zhang, R. F. Wang, and C. Xiao, “Exploring the limits of ChatGPT in software security applications,” arXiv preprint arXiv:2312.05275, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
- R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and A. Mulyar, “Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo,” https://github.com/nomic-ai/gpt4all, 2023.
- Harrison Green (2 papers)
- Edward J. Schwartz (7 papers)
- Claire Le Goues (34 papers)
- Bogdan Vasilescu (22 papers)