Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

STRIDE: Simple Type Recognition In Decompiled Executables (2407.02733v1)

Published 3 Jul 2024 in cs.CR

Abstract: Decompilers are widely used by security researchers and developers to reverse engineer executable code. While modern decompilers are adept at recovering instructions, control flow, and function boundaries, some useful information from the original source code, such as variable types and names, is lost during the compilation process. Our work aims to predict these variable types and names from the remaining information. We propose STRIDE, a lightweight technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming while being much simpler and faster. We perform a detailed comparison with two recent SOTA transformer-based models in order to understand the specific factors that make our technique effective. We implemented STRIDE in fewer than 1000 lines of Python and have open-sourced it under a permissive license at https://github.com/hgarrereyn/STRIDE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Hex-rays decompiler,” Available at https://hex-rays.com/decompiler/ (2023/02/06).
  2. K. Yakdan, S. Dechand, E. Gerhards-Padilla, and M. Smith, “Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study,” in IEEE Symposium on Security and Privacy, 2016.
  3. D. Votipka, S. M. Rabin, K. Micinski, J. S. Foster, and M. M. Mazurek, “An observational investigation of reverse engineers’ processes,” in Proceedings of the USENIX Security Symposium, 2020.
  4. L. Ďurfina, J. Křoustek, and P. Zemek, “Psybot malware: A step-by-step decompilation case study,” in Proceedings of the Working Conference on Reverse Engineering, 2013.
  5. M. J. Van Emmerik, “Single static assignment for decompilation,” Ph.D. dissertation, The University of Queensland School of Information Technology and Electrical Engineering, May 2007. [Online]. Available: http://vanemmerikfamily.com/mike/master.pdf
  6. K. Yakdan, S. Eschweiler, E. Gerhards-Padilla, and M. Smith, “No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations.” in Network and Distributed System Security Symposium, 2015.
  7. K. Burk, F. Pagani, C. Kruegel, and G. Vigna, “Decomperson: How humans decompile and what we can learn from it,” in USENIX Security Symposium, 2022, pp. 2765–2782.
  8. A. Mantovani, L. Compagna, Y. Shoshitaishvili, and D. Balzarotti, “The convergence of source code and binary vulnerability discovery–a case study,” in Proceedings of the ACM Asia Conference on Computer and Communications Security, 2022.
  9. S. Kalle, N. Ameen, H. Yoo, and I. Ahmed, “Clik on plcs! attacking control logic with decompilation and virtual plc,” in Proceedings of the Binary Analysis Research Workshop, Network and Distributed System Security Symposium, 2019.
  10. Z. Liu and S. Wang, “How far we have come: Testing decompilation correctness of c decompilers,” in Proceedings of the ACM International Symposium on Software Testing and Analysis, 2020.
  11. M. Van Emmerik and T. Waddington, “Using a decompiler for real-world source recovery,” in Proceedings of the Working Conference on Reverse Engineering, 2004.
  12. A. Jaffe, J. Lacomis, E. J. Schwartz, C. L. Goues, and B. Vasilescu, “Meaningful variable names for decompiled code: A machine translation approach,” in Proceedings of the IEEE/ACM International Conference on Program Comprehension, 2018.
  13. C. Cifuentes, “Partial automation of an integrated reverse engineering environment of binary code,” in Proceedings of the Working Conference on Reverse Engineering, 1996.
  14. A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina, “SmartDec: approaching C++ decompilation,” in Proceedings of the Working Conference on Reverse Engineering, 2011.
  15. A. Mycroft, “Type-based decompilation,” in European Symposium on Programming, Mar. 1999.
  16. P. Chapman, J. Burket, and D. Brumley, “PicoCTF: A game-based computer security competition for high school students,” in Proceedings of the USENIX Summit on Gaming, Games, and Gamification in Security Education (3GSE 14), 2014.
  17. T. J. Burns, S. C. Rios, T. K. Jordan, Q. Gu, and T. Underwood, “Analysis and exercises for engaging beginners in online CTF competitions for security education.” in Proceedings of the USENIX Workshop on Advances in Security Education, 2017.
  18. J. Song and J. Alves-Foss, “The DARPA cyber grand challenge: A competitor’s perspective,” IEEE Security & Privacy, vol. 13, no. 6, pp. 72–76, 2015.
  19. J. Caballero and Z. Lin, “Type inference on executables,” ACM Computing Surveys (CSUR), vol. 48, no. 4, pp. 1–35, 2016.
  20. J. Lee, T. Avgerinos, and D. Brumley, “TIE: Principled reverse engineering of types in binary programs,” in Proceedings of the Network and Distributed System Security Symposium, 2011.
  21. E. J. Schwartz, J. Lee, M. Woo, and D. Brumley, “Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring,” in Proceedings of the USENIX Security Symposium, 2013.
  22. K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua, “Scalable variable and data type detection in a binary rewriter,” in Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, 2013, pp. 51–60.
  23. M. Noonan, A. Loginov, and D. Cok, “Polymorphic type inference for machine code,” in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, pp. 27–41.
  24. Z. Lin, X. Zhang, and D. Xu, “Automatic reverse engineering of data structures from binary execution,” in Proceedings of the Network and Distributed System Security Symposium, 2010.
  25. A. Slowinska, T. Stancescu, and H. Bos, “Howard: A dynamic excavator for reverse engineering data structures.” in NDSS, 2011.
  26. Z. Zhang, Y. Ye, W. You, G. Tao, W.-c. Lee, Y. Kwon, Y. Aafer, and X. Zhang, “Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary,” in 2021 IEEE Symposium on Security and Privacy (SP).   IEEE, 2021, pp. 813–832.
  27. J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 1667–1680.
  28. A. Maier, H. Gascon, C. Wressnegger, and K. Rieck, “Typeminer: Recovering types in binary programs using machine learning,” in Detection of Intrusions and Malware, and Vulnerability Assessment: 16th International Conference, DIMVA 2019, Gothenburg, Sweden, June 19–20, 2019, Proceedings 16.   Springer, 2019, pp. 288–308.
  29. L. Chen, Z. He, and B. Mao, “Cati: Context-assisted type inference from stripped binaries,” in 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).   IEEE, 2020, pp. 88–98.
  30. Q. Chen, J. Lacomis, E. J. Schwartz, C. Le Goues, G. Neubig, and B. Vasilescu, “Augmenting decompiler output with learned variable names and types,” in USENIX Security Symposium, 2022, pp. 4327–4343.
  31. J. Lacomis, P. Yin, E. Schwartz, M. Allamanis, C. Le Goues, G. Neubig, and B. Vasilescu, “Dire: A neural approach to decompiled identifier naming,” in IEEE/ACM International Conference on Automated Software Engineering.   IEEE, 2019, pp. 628–639.
  32. V. Nitin, A. Saieva, B. Ray, and G. Kaiser, “Direct: A transformer-based model for decompiled identifier renaming,” in Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 2021, pp. 48–57.
  33. J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “HexT5: Unified pre-training for stripped binary code information inference,” in IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2023, pp. 774–786.
  34. K. K. Pal, A. P. Bajaj, P. Banerjee, A. Dutcher, M. Nakamura, Z. L. Basque, H. Gupta, S. A. Sawant, U. Anantheswaran, Y. Shoshitaishvili et al., ““len or index or count, anything but v1”: Predicting variable names in decompilation output with transfer learning,” in IEEE Symposium on Security and Privacy (SP).   IEEE Computer Society, 2024, pp. 152–152.
  35. P. Hu, R. Liang, and K. Chen, “DeGPT: Optimizing decompiler output with LLM,” in Network and Distributed System Security Symposium, 2024.
  36. X. Xu, Z. Zhang, S. Feng, Y. Ye, Z. Su, N. Jiang, S. Cheng, L. Tan, and X. Zhang, “Lmpa: Improving decompilation by synergy of large language model and program analysis,” arXiv preprint arXiv:2306.02546, 2023.
  37. H. Tan, Q. Luo, J. Li, and Y. Zhang, “Llm4decompile: Decompiling binary code with large language models,” arXiv preprint arXiv:2403.05286, 2024.
  38. C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
  39. G. Neubig and C. Dyer, “Generalizing and hybridizing count-based and neural language models,” arXiv preprint arXiv:1606.00499, 2016.
  40. Y. Wainakh, M. Rauf, and M. Pradel, “Idbench: Evaluating semantic representations of identifier names in source code,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).   IEEE, 2021, pp. 562–573.
  41. Q. Chen, J. Lacomis, E. J. Schwartz, G. Neubig, B. Vasilescu, and C. L. Goues, “Varclr: Variable semantic representation pre-training via contrastive learning,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 2327–2339.
  42. H. Pearce, B. Tan, P. Krishnamurthy, F. Khorrami, R. Karri, and B. Dolan-Gavitt, “Pop quiz! can a large language model help with reverse engineering?” arXiv preprint arXiv:2202.01142, 2022.
  43. F. Wu, Q. Zhang, A. P. Bajaj, T. Bao, N. Zhang, R. F. Wang, and C. Xiao, “Exploring the limits of ChatGPT in software security applications,” arXiv preprint arXiv:2312.05275, 2023.
  44. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
  45. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
  46. Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and A. Mulyar, “Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo,” https://github.com/nomic-ai/gpt4all, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Harrison Green (2 papers)
  2. Edward J. Schwartz (7 papers)
  3. Claire Le Goues (34 papers)
  4. Bogdan Vasilescu (22 papers)

Summary

A Consideration of STRIDE: A Simplified Yet Effective Approach to Variable Type and Name Prediction

The paper under discussion presents "STRIDE: Simple Type Recognition In Decompiled Executables," a meticulously designed framework to address the persisting challenge in reverse engineering of accurately recovering variable types and names from decompiled executables. Developed by researchers at Carnegie Mellon University, this paper introduces a non-neural, statistical approach that promises to rival and, at times, surpass the capabilities of more computationally intensive machine learning models.

Context and Problem Statement

In the context of decompilation, a crucial aspect of reverse engineering, the accurate retrieval of variable names and types is hampered by the loss of this information during the compilation phase. Conventional decompilers excel at reconstructing control flows and function boundaries but falter in restoring semantically rich variable information without debug metadata. Consequently, the inability to retrieve variable metadata renders decompiled code harder to understand and manipulate, posing significant challenges in software maintenance, vulnerability analysis, and reverse engineering of malware.

The STRIDE Methodology

STRIDE differentiates itself by employing an N-gram-based approach, an intuitive strategy that draws heavily from classical natural language processing techniques. The assumption is clear: the most informative clues for inferring a variable's type or name can be found in the contextual token sequences surrounding its occurrences in the decompiled code. The system constructs a database of these N-grams derived from training data, storing the most frequent variable names and types associated with each N-gram.

This structured database allows STRIDE to match these token sequences with unseen data efficiently. During inference, STRIDE finds the largest matching N-grams surrounding a target variable and aggregates information from these matches, using it to propose likely names or types. The authors emphasize that larger, more precise N-gram matches indicate higher confidence in the prediction, allowing STRIDE to perform competitively against previous state-of-the-art techniques while operating faster and with less computational overhead.

Performance and Evaluation

Notably, STRIDE was benchmarked against prominent machine learning models, including transformer-based architectures, across datasets such as DIRT, DIRE, and VarCorpus, specifically focusing on variable renaming and retyping tasks. It demonstrated remarkable capability, achieving accuracy improvements on key benchmarks: a 66.4% accuracy on the 'not-in-train' split of the DIRT dataset for retyping, marking a 14.1% improvement over DIRTY (a competitive transformer model configuration), and a 56.2% accuracy for renaming, outperforming previous methods by 4.9%.

Moreover, STRIDE's efficiency is underscored by its prediction speed. Operating on a CPU, STRIDE offers over a fivefold increase in prediction speed over its GPU-accelerated contemporaries such as DIRTY and VarBERT, demonstrating its potential utility in resource-constrained environments.

Implications and Future Directions

The implications of STRIDE's methodology are quite significant. By shifting away from complex neural architectures to a more straightforward statistical matching approach, the paper challenges the prevailing trend toward increasingly larger and sophisticated ML models. It highlights the potential of simpler, domain-aware strategies in achieving similar, if not superior, results in specific technical tasks.

Future developments stemming from this research could involve refining STRIDE's methodology to incorporate some aspects of neural network models, creating hybrid approaches that leverage the strengths of both paradigms. Additionally, expanding the technique to accommodate more languages and compiled binaries with different characteristics could broaden its applicability.

Lastly, the convenience of STRIDE operating without the need for extensive pre-training on GPUs makes it highly accessible, aligning its use case with real-world applications where performance speed and computational simplicity are paramount.

In summary, STRIDE offers a streamlined, efficient alternative for variable recognition in the domain of reverse engineering of executables, emphasizing that innovation lies as much in revisiting and optimizing classical approaches as it does in pioneering new machine learning paradigms.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub