Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models (2312.09601v1)
Abstract: Binary code summarization, while invaluable for understanding code semantics, is challenging due to its labor-intensive nature. This study delves into the potential of LLMs for binary code comprehension. To this end, we present BinSum, a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.
- angr decompiler. https://docs.angr.io/en/latest/analyses/decompiler.html. Accessed on 10-09-2023.
- Capstone - the ultimate disassembler. https://www.capstone-engine.org/. Accessed on 10-09-2023.
- Chatgpt: Everything you need to know about openai’s gpt-4 tool. https://www.sciencefocus.com/future-technology/gpt-3. Accessed on 10-01-2023.
- Chroma - the ai-native open-source embedding database. https://www.trychroma.com/. Accessed on 09-26-2023.
- Code llama. https://huggingface.co/codellama. Accessed on 10-01-2023.
- Common crawl dataset. https://registry.opendata.aws/commoncrawl/. Accessed on 10-07-2023.
- Darpa cyber grand challenge. https://www.darpa.mil/program/cyber-grand-challenge. Accessed on 10-06-2023.
- Gnu software - free software foundation. https://www.gnu.org/software/. Accessed on 10-03-2023.
- Introducing ai-powered insights in threat intelligence. https://cloud.google.com/blog/products/identity-security/rsa-introducing-ai-powered-insights-threat-intelligence. Accessed on 10-12-2023.
- Introducing microsoft security copilot. https://www.microsoft.com/en-us/security/business/ai-machine-learning/microsoft-security-copilot. Accessed on 10-12-2023.
- Introducing virustotal code insight: Empowering threat analysis with generative ai. https://blog.virustotal.com/2023/04/introducing-virustotal-code-insight.html. Accessed on 10-08-2023.
- Meta llama 2. https://huggingface.co/meta-llama. Accessed on 10-01-2023.
- Module ida_hexrays. https://www.hex-rays.com/products/ida/support/idapython_docs/ida_hexrays.html. Accessed on 10-09-2023.
- Module ida_idaapi. https://www.hex-rays.com/products/ida/support/idapython_docs/ida_idaapi.html. Accessed on 10-09-2023.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. Accessed on Date of 09-25-2023.
- Openai api reference - create chat completion. https://platform.openai.com/docs/api-reference/chat/create. Accessed on 09-25-2023.
- Package ghidra.app.decompiler. https://ghidra.re/ghidra_docs/api/ghidra/app/decompiler/package-summary.html. Accessed on 10-09-2023.
- pyelftools - parsing elf and dwarf in python. https://github.com/eliben/pyelftools. Accessed on 10-09-2023.
- Sentence-transformers - pretrained models. https://www.sbert.net/docs/pretrained_models.html#pretrained-models. Accessed on 09-26-2023.
- https://blog.virustotal.com/2023/05/vt-code-insight-updates-and-q-on.html. Accessed on 10-12-2023.
- Ghidra. https://ghidra-sre.org/, 2023. Accessed on 12-03-2023.
- Openai - rate limits. https://platform.openai.com/docs/guides/rate-limits/overview, Accessed: 2023-09-29.
- Extending source code pre-trained language models to summarise decompiled binarie. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 260–271. IEEE, 2023.
- A survey of symbolic execution techniques. ACM Computing Surveys (CSUR), 51(3):1–39, 2018.
- Variable name recovery in decompiled binary code using constrained masked language modeling. arXiv preprint arXiv:2103.12801, 2021.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Augmenting decompiler output with learned variable names and types. In 31st USENIX Security Symposium (USENIX Security 22), pages 4327–4343, 2022.
- Why my code summarization model does not work: Code comment improvement with category prediction. ACM Transactions on Software Engineering and Methodology (TOSEM), 30(2):1–29, 2021.
- srcml: An infrastructure for the exploration, analysis, and manipulation of source code: A tool demonstration. In 2013 IEEE International conference on software maintenance, pages 516–519. IEEE, 2013.
- Neural reverse engineering of stripped binaries using augmented control flow graphs. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–28, 2020.
- Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Vulseeker: A semantic learning based vulnerability seeker for cross-platform binary. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 896–899, 2018.
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
- Syzdescribe: Principled, automated, static generation of syscall descriptions for kernel drivers. In 2023 IEEE Symposium on Security and Privacy (SP), pages 3262–3278. IEEE Computer Society, 2023.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016, pages 2073–2083. Association for Computational Linguistics, 2016.
- Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1631–1645, 2022.
- The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Palmtree: Learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3236–3251, 2021.
- Guiding large language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520, 2023.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Vulhawk: Cross-architecture vulnerability detection with entropy-based binary code search. In NDSS, 2023.
- {{\{{RE-Mind}}\}}: a first look inside the mind of a reverse engineer. In 31st USENIX Security Symposium (USENIX Security 22), pages 2727–2745, 2022.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 2021.
- Deep learning–based text classification: a comprehensive review. ACM computing surveys (CSUR), 54(3):1–40, 2021.
- R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
- Dynamic malware analysis in the modern era—a state of the art survey. ACM Computing Surveys (CSUR), 52(5):1–48, 2019.
- Automated generation of security-centric descriptions for smart contract bytecode. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1244–1256, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Terence Parr. The definitive antlr 4 reference. The Definitive ANTLR 4 Reference, pages 1–326, 2013.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Xfl: Naming functions in binaries with extreme multi-label learning. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2375–2390. IEEE, 2023.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Stateformer: Fine-grained type recovery from binaries using generative state modeling. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 690–702, 2021.
- Xda: Accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770, 2020.
- Neudep: neural binary memory dependence analysis. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 747–759, 2022.
- Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680, 2020.
- Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- On the evaluation of neural code summarization. In Proceedings of the 44th International Conference on Software Engineering, pages 1597–1608, 2022.
- Sok:(state of) the art of war: Offensive techniques in binary analysis. In 2016 IEEE symposium on security and privacy (SP), pages 138–157. IEEE, 2016.
- Jacob Stern. Gpt-4 might just be a bloated, pointless mess - the atlantic. https://www.theatlantic.com/technology/archive/2023/03/openai-gpt-4-parameters-power-debate/673290/. (Accessed on 10/12/2023).
- Is chatgpt the ultimate programming assistant–how far is it? arXiv preprint arXiv:2304.11938, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
- An observational investigation of reverse {{\{{Engineers’}}\}} processes. In 29th USENIX Security Symposium (USENIX Security 20), pages 1875–1892, 2020.
- Jtrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1–13, 2022.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study. In 2016 IEEE Symposium on Security and Privacy (SP), pages 158–177. IEEE, 2016.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081, 2023.
- Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 1145–1152, 2020.
- Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems, 33:3872–3883, 2020.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
- Xin Jin (285 papers)
- Jonathan Larson (23 papers)
- Weiwei Yang (33 papers)
- Zhiqiang Lin (27 papers)