EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations (2410.22821v1)
Abstract: How to evaluate LLMs in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: (1) Evolving data. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories. (2) A domain taxonomy and domain labels. Based on the statistics of open-source communities, we design a programming domain taxonomy consisting of 10 popular domains. Based on the taxonomy, we annotate each sample in EvoCodeBench with a domain label. (3) Domain-specific evaluations. Besides the Pass@k, we compute the Domain-Specific Improvement (DSI) and define LLMs' comfort and strange domains. These evaluations help practitioners select superior LLMs in specific domains and discover the shortcomings of existing LLMs. We evaluate 8 popular LLMs (e.g., gpt-4, DeepSeek Coder) on EvoCodeBench and summarize some insights. EvoCodeBench reveals the actual abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 on EvoCodeBench-2403 is only 20.74%. Besides, we evaluate LLMs in different domains and discover their comfort and strange domains. For example, gpt-4 performs best in most domains but falls behind others in the Internet domain. StarCoder 2-15B unexpectedly performs well in the Database domain and even outperforms 33B LLMs. EvoCodeBench has been released.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
- Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023.
- Evaluating large language models trained on code. CoRR, 2021.
- Controlled text generation via language model arithmetic. CoRR, abs/2311.14479, 2023.
- Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 12039–12050, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics.
- Evaluating large language models in class-level code generation. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 81:1–81:13. ACM, 2024.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858, 2022.
- Datasheets for datasets. Commun. ACM, 64(12):86–92, 2021.
- Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196, 2024.
- Measuring coding challenge competence with APPS. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
- Mapping language to code in programmatic context. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1643–1652. Association for Computational Linguistics, 2018.
- Livecodebench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974, 2024.
- aixcoder-7b: A lightweight and effective large language model for code completion. arXiv preprint arXiv:2410.13187, 2024.
- Structured chain-of-thought prompting for code generation. ACM Trans. Softw. Eng. Methodol., August 2024. Just Accepted.
- Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024. Association for Computational Linguistics.
- Skcoder: A sketch-based approach for automatic code generation. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 2124–2135. IEEE, 2023.
- Acecoder: An effective prompting technique specialized in code generation. ACM Trans. Softw. Eng. Methodol., July 2024. Just Accepted.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- OpenAI. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Pip. Pip. https://pypi.org/project/pip, 2024.
- Pyan. Pyan. https://github.com/davidfraser/pyan, 2023.
- PyPI. Pypi. https://pypi.org/.
- Pytest. Pytest. https://docs.pytest.org/en/8.0.x/, 2024.
- Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023.
- Top leaderboard ranking = top coding proficiency, always? evoeval: Evolving coding benchmarks via LLM. CoRR, abs/2403.19114, 2024.
- Learning to mine aligned code and natural language pairs from stack overflow. In Andy Zaidman, Yasutaka Kamei, and Emily Hill, editors, Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 476–486. ACM, 2018.
- Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 37:1–37:12. ACM, 2024.
- CERT: continual pre-training on sketches for library-oriented code generation. In Luc De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2369–2375. ijcai.org, 2022.
Collections
Sign up for free to add this paper to one or more collections.