EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories (2404.00599v1)

Published 31 Mar 2024 in cs.CL, cs.AI, and cs.SE

Abstract: How to evaluate LLMs in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (38)

Authors (5)

Jia Li (380 papers)
Ge Li (213 papers)
Xuanming Zhang (20 papers)
Yihong Dong (35 papers)
Zhi Jin (160 papers)

Citations (15)

View on Semantic Scholar

Tweets

https://twitter.com/ComputerPapers/status/1775378829852184888

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories (2404.00599v1)

Related Papers

Tweets