Test-Driven Development for Code Generation (2402.13521v2)
Abstract: Recent LLMs have demonstrated significant capabilities in generating code snippets directly from problem statements. This increasingly automated process mirrors traditional human-led software development, where code is often written in response to a requirement. Historically, Test-Driven Development (TDD) has proven its merit, requiring developers to write tests before the functional code, ensuring alignment with the initial problem statements. Applying TDD principles to LLM-based code generation offers one distinct benefit: it enables developers to verify the correctness of generated code against predefined tests. This paper investigates if and how TDD can be incorporated into AI-assisted code-generation processes. We experimentally evaluate our hypothesis that providing LLMs like GPT-4 and Llama 3 with tests in addition to the problem statements enhances code generation outcomes. We experimented with established function-level code generation benchmarks such as MBPP and HumanEval. Our results consistently demonstrate that including test cases leads to higher success in solving programming challenges. We assert that TDD is a promising paradigm for helping ensure that the code generated by LLMs effectively captures the requirements.
- J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,” arXiv preprint arXiv:2302.11382, 2023.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models.”
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
- V. Braga, “Decentralised autonomous society through large language models’ based agents: a pathway to empower small communities,” Journal of Engineering Sciences, no. 3, pp. 88–120, 2023.
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
- Y. Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth et al., “Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion,” arXiv preprint arXiv:2310.11248, 2023.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” arXiv preprint arXiv:2310.06770, 2023.
- N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an autonomous agent with dynamic memory and self-reflection,” arXiv preprint arXiv:2303.11366, 2023.
- Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen, “Making language models better reasoners with step-aware verifier,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5315–5333.
- A. Ni, S. Iyer, D. Radev, V. Stoyanov, W.-t. Yih, S. Wang, and X. V. Lin, “Lever: Learning to verify language-to-code generation with execution,” in International Conference on Machine Learning. PMLR, 2023, pp. 26 106–26 128.
- Y. Weng, M. Zhu, S. He, K. Liu, and J. Zhao, “Large language models are reasoners with self-verification,” arXiv preprint arXiv:2212.09561, 2022.
- F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and W. Chen, “Repocoder: Repository-level code completion through iterative retrieval and generation,” arXiv preprint arXiv:2303.12570, 2023.
- X. Wang, Y. Wang, Y. Wan, F. Mi, Y. Li, P. Zhou, J. Liu, H. Wu, X. Jiang, and Q. Liu, “Compilable neural code generation with compiler feedback,” arXiv preprint arXiv:2203.05132, 2022.
- X. Wang, X. Liu, P. Zhou, Q. Liu, J. Liu, H. Wu, and X. Cui, “Test-driven multi-task learning with functionally equivalent code transformation for neural code generation,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–6.
- S. K. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh, M. Musuvathi, J. P. Inala, C. Wang, and J. Gao, “Interactive code generation via test-driven user-intent formalization,” arXiv preprint arXiv:2208.05950, 2022.
- B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” arXiv preprint arXiv:2207.10397, 2022.
- B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
- F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman et al., “Multipl-e: a scalable and polyglot approach to benchmarking neural code generation,” IEEE Transactions on Software Engineering, 2023.
- Noble Saji Mathews (13 papers)
- Meiyappan Nagappan (25 papers)