EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories (2404.00599v1)
Abstract: How to evaluate LLMs in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.
- Program synthesis with large language models. CoRR, abs/2108.07732.
- Qwen technical report. CoRR, abs/2309.16609.
- Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712.
- Evaluating large language models trained on code. CoRR.
- Teaching large language models to self-debug. CoRR, abs/2304.05128.
- Controlled text generation via language model arithmetic. CoRR, abs/2311.14479.
- Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. CoRR, abs/2310.11248.
- Self-collaboration code generation via chatgpt. CoRR, abs/2304.07590.
- Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. CoRR, abs/2308.01861.
- Incoder: A generative model for code infilling and synthesis. In ICLR. OpenReview.net.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858.
- GemmaTeam. 2024. Gemma: Open models based on gemini research and technology. CoRR, abs/2403.08295.
- Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 39:1–39:13. ACM.
- GitHub. 2023. Github copilot. https://github.com/features/copilot.
- Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196.
- Measuring coding challenge competence with APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143.
- Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1643–1652. Association for Computational Linguistics.
- Self-planning code generation with large language model. CoRR, abs/2303.06689.
- Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770.
- Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599.
- Skcoder: A sketch-based approach for automatic code generation. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 2124–2135. IEEE.
- Acecoder: Utilizing existing code to enhance code generation. arXiv preprint arXiv:2303.17780.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Repobench: Benchmarking repository-level code auto-completion systems. CoRR, abs/2306.03091.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173.
- Codegen: An open large language model for code with multi-turn program synthesis. In ICLR. OpenReview.net.
- OpenAI. 2023a. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
- OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
- Pyan. 2023. Pyan. https://github.com/davidfraser/pyan.
- Code llama: Open foundation models for code. CoRR, abs/2308.12950.
- Incorporating domain knowledge through task augmentation for front-end javascript code generation. In ESEC/SIGSOFT FSE, pages 1533–1543. ACM.
- Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31693–31715. PMLR.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 476–486. ACM.
- Codereval: A benchmark of pragmatic code generation with generative pre-trained models. CoRR, abs/2302.00288.
- CERT: continual pre-training on sketches for library-oriented code generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2369–2375. ijcai.org.
- Repocoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2471–2484. Association for Computational Linguistics.
- Jia Li (380 papers)
- Ge Li (213 papers)
- Xuanming Zhang (20 papers)
- Yihong Dong (35 papers)
- Zhi Jin (160 papers)