Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models (2312.09601v1)

Published 15 Dec 2023 in cs.CR, cs.CL, cs.LG, and cs.SE

Abstract: Binary code summarization, while invaluable for understanding code semantics, is challenging due to its labor-intensive nature. This study delves into the potential of LLMs for binary code comprehension. To this end, we present BinSum, a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (84)

Authors (4)

Xin Jin (285 papers)
Jonathan Larson (23 papers)
Weiwei Yang (33 papers)
Zhiqiang Lin (27 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/135746708/status/1736611663166497232

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models (2312.09601v1)

Related Papers

Tweets