Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study (2403.08604v3)

Published 13 Mar 2024 in cs.CL and cs.SE

Abstract: Recent advancements in LLMs have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Openai gpt, 2023. URL https://platform.openai.com/docs/models/overview.
  2. Program synthesis with large language models, 2021.
  3. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  4. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49:3675–3691, 2023. URL https://api.semanticscholar.org/CorpusID:258205341.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  6. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  7. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. ArXiv, abs/2310.11248, 2023. URL https://api.semanticscholar.org/CorpusID:264172238.
  8. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  9. Measuring coding challenge competence with apps. NeurIPS, 2021.
  10. Metagpt: Meta programming for a multi-agent collaborative framework, 2023.
  11. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  12. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022. URL https://api.semanticscholar.org/CorpusID:253734939.
  13. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023.
  14. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023a.
  15. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023b.
  16. Ml-bench: Large language models leverage open-source libraries for machine learning tasks. ArXiv, abs/2311.09835, 2023c. URL https://api.semanticscholar.org/CorpusID:265221105.
  17. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  18. On the use of package managers by the c++ open-source community. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp.  1483–1491, 2018.
  19. Octopack: Instruction tuning code large language models. ArXiv, abs/2308.07124, 2023. URL https://api.semanticscholar.org/CorpusID:260886874.
  20. Communicative agents for software development, 2023a.
  21. Experiential co-learning of software-developing agents. arXiv preprint arXiv:2312.17025, 2023b.
  22. Royce, W. W. Managing the development of large software systems: concepts and techniques. In Proceedings of the 9th international conference on Software Engineering, pp.  328–338, 1987.
  23. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  24. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  25. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  26. Execution-based evaluation for open-domain code generation. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:254877069.
  27. Intercode: Standardizing and benchmarking interactive coding with execution feedback. ArXiv, abs/2306.14898, 2023. URL https://api.semanticscholar.org/CorpusID:259262186.
  28. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  29. Natural language to code generation in interactive data science notebooks. ArXiv, abs/2212.09248, 2022. URL https://api.semanticscholar.org/CorpusID:254854112.
  30. Repocoder: Repository-level code completion through iterative retrieval and generation. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:257663528.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
  32. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Bowen Li (166 papers)
  2. Wenhan Wu (8 papers)
  3. Ziwei Tang (3 papers)
  4. Lin Shi (39 papers)
  5. John Yang (22 papers)
  6. Jinyang Li (67 papers)
  7. Shunyu Yao (72 papers)
  8. Chen Qian (226 papers)
  9. Binyuan Hui (57 papers)
  10. Qicheng Zhang (33 papers)
  11. Zhiyin Yu (3 papers)
  12. He Du (4 papers)
  13. Ping Yang (83 papers)
  14. Dahua Lin (336 papers)
  15. Chao Peng (66 papers)
  16. Kai Chen (512 papers)
Citations (9)