DevEval: Evaluating Code Generation in Practical Software Projects (2401.06401v4)

Published 12 Jan 2024 in cs.SE, cs.AI, and cs.CL

Abstract: How to evaluate LLMs in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42 in our experiments. We also discuss the challenges and future directions of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (23)

Authors (17)

Jia Li (380 papers)
Ge Li (213 papers)
Yunfei Zhao (13 papers)
Yongmin Li (32 papers)
Zhi Jin (160 papers)
Hao Zhu (212 papers)
Huanyu Liu (15 papers)
Kaibo Liu (17 papers)
Lecheng Wang (8 papers)
Zheng Fang (103 papers)
Lanshen Wang (2 papers)
Jiazheng Ding (5 papers)
Xuanming Zhang (20 papers)
Yihong Dong (35 papers)
Yuqi Zhu (25 papers)
Bin Gu (86 papers)
Mengfei Yang (6 papers)

Citations (7)

View on Semantic Scholar

Tweets

https://twitter.com/ComputerPapers/status/1765347556978995310

https://twitter.com/ComputerPapers/status/1765757382322233389

https://twitter.com/ComputerPapers/status/1746765959669407995

DevEval: Evaluating Code Generation in Practical Software Projects (2401.06401v4)

Related Papers

Tweets