Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DevEval: Evaluating Code Generation in Practical Software Projects (2401.06401v4)

Published 12 Jan 2024 in cs.SE, cs.AI, and cs.CL

Abstract: How to evaluate LLMs in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42 in our experiments. We also discuss the challenges and future directions of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Program synthesis with large language models. CoRR, abs/2108.07732.
  2. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  3. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. CoRR, abs/2310.11248.
  4. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. CoRR, abs/2308.01861.
  5. GitHub. 2023. Github copilot. https://github.com/features/copilot.
  6. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1643–1652. Association for Computational Linguistics.
  7. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599.
  8. Skcoder: A sketch-based approach for automatic code generation. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 2124–2135. IEEE.
  9. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2303.17780.
  10. Starcoder: may the source be with you! CoRR, abs/2305.06161.
  11. Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172.
  12. OpenAI. 2023a. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
  13. OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
  14. Pyan. 2023. Pyan. https://github.com/davidfraser/pyan.
  15. PyPI. Pypi. https://pypi.org/.
  16. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  17. In-context pretraining: Language modeling beyond document boundaries. CoRR, abs/2310.10638.
  18. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31693–31715. PMLR.
  19. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 476–486. ACM.
  20. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. CoRR, abs/2302.00288.
  21. CERT: continual pre-training on sketches for library-oriented code generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2369–2375. ijcai.org.
  22. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, Singapore. Association for Computational Linguistics.
  23. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 769–787. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Jia Li (380 papers)
  2. Ge Li (213 papers)
  3. Yunfei Zhao (13 papers)
  4. Yongmin Li (32 papers)
  5. Zhi Jin (160 papers)
  6. Hao Zhu (212 papers)
  7. Huanyu Liu (15 papers)
  8. Kaibo Liu (17 papers)
  9. Lecheng Wang (8 papers)
  10. Zheng Fang (103 papers)
  11. Lanshen Wang (2 papers)
  12. Jiazheng Ding (5 papers)
  13. Xuanming Zhang (20 papers)
  14. Yihong Dong (35 papers)
  15. Yuqi Zhu (25 papers)
  16. Bin Gu (86 papers)
  17. Mengfei Yang (6 papers)
Citations (7)