Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories (2404.00599v1)

Published 31 Mar 2024 in cs.CL, cs.AI, and cs.SE

Abstract: How to evaluate LLMs in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Program synthesis with large language models. CoRR, abs/2108.07732.
  2. Qwen technical report. CoRR, abs/2309.16609.
  3. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712.
  4. Evaluating large language models trained on code. CoRR.
  5. Teaching large language models to self-debug. CoRR, abs/2304.05128.
  6. Controlled text generation via language model arithmetic. CoRR, abs/2311.14479.
  7. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. CoRR, abs/2310.11248.
  8. Self-collaboration code generation via chatgpt. CoRR, abs/2304.07590.
  9. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. CoRR, abs/2308.01861.
  10. Incoder: A generative model for code infilling and synthesis. In ICLR. OpenReview.net.
  11. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858.
  12. GemmaTeam. 2024. Gemma: Open models based on gemini research and technology. CoRR, abs/2403.08295.
  13. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 39:1–39:13. ACM.
  14. GitHub. 2023. Github copilot. https://github.com/features/copilot.
  15. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196.
  16. Measuring coding challenge competence with APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  17. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143.
  18. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1643–1652. Association for Computational Linguistics.
  19. Self-planning code generation with large language model. CoRR, abs/2303.06689.
  20. Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770.
  21. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599.
  22. Skcoder: A sketch-based approach for automatic code generation. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 2124–2135. IEEE.
  23. Acecoder: Utilizing existing code to enhance code generation. arXiv preprint arXiv:2303.17780.
  24. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  25. Repobench: Benchmarking repository-level code auto-completion systems. CoRR, abs/2306.03091.
  26. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173.
  27. Codegen: An open large language model for code with multi-turn program synthesis. In ICLR. OpenReview.net.
  28. OpenAI. 2023a. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
  29. OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
  30. Pyan. 2023. Pyan. https://github.com/davidfraser/pyan.
  31. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  32. Incorporating domain knowledge through task augmentation for front-end javascript code generation. In ESEC/SIGSOFT FSE, pages 1533–1543. ACM.
  33. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31693–31715. PMLR.
  34. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  35. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 476–486. ACM.
  36. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. CoRR, abs/2302.00288.
  37. CERT: continual pre-training on sketches for library-oriented code generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2369–2375. ijcai.org.
  38. Repocoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2471–2484. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jia Li (380 papers)
  2. Ge Li (213 papers)
  3. Xuanming Zhang (20 papers)
  4. Yihong Dong (35 papers)
  5. Zhi Jin (160 papers)
Citations (15)
X Twitter Logo Streamline Icon: https://streamlinehq.com