PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation (2401.07576v2)
Abstract: Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small LLM, to outperform much larger LLMs like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.
- A3Test: Assertion-Augmented Automated Test Case Generation. arXiv preprint arXiv:2302.10352 (2023).
- An industrial evaluation of unit test generation: Finding real faults in a financial application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 263–272.
- Scott W Ambler. 2004. The object primer: Agile model-driven development with UML 2.0. Cambridge University Press.
- Dave Astels. 2003. Test driven development: A practical guide. Prentice Hall Professional Technical Reference.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
- An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).
- Kent Beck. 2022. Test driven development: By example. Addison-Wesley Professional.
- Manifesto for agile software development. (2001).
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Xia Cai and Michael R Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the 1st International Workshop on Advances in Model-based Testing. 1–7.
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
- Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 7897 (2022), 414–419.
- On the effectiveness of the test-first approach to programming. IEEE Transactions on software Engineering 31, 3 (2005), 226–237.
- Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
- Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software 156 (2019), 312–327.
- Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. arXiv preprint arXiv:2309.08221 (2023).
- Hadi Hemmati. 2015. How effective are code coverage criteria?. In 2015 IEEE International Conference on Software Quality, Reliability and Security. IEEE, 151–156.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021).
- Large Language Models for Software Engineering: A Systematic Literature Review. arXiv preprint arXiv:2308.10620 (2023).
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
- Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 364, 6443 (2019), 859–865.
- David Janzen and Hossein Saiedian. 2008. Does test-driven development really improve software design quality? Ieee Software 25, 2 (2008), 77–84.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456 (2019).
- Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Mariam Kiran and Melis Ozyildirim. 2022. Hyperparameter tuning for deep reinforcement learning applications. arXiv preprint arXiv:2201.11182 (2022).
- Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms. Advances in neural information processing systems 12 (1999).
- Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 21314–21328.
- StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
- Lech Madeyski. 2010. The impact of test-first programming on branch coverage and mutation score indicator of unit tests: An experiment. Information and Software Technology 52, 2 (2010), 169–184.
- Matthias M. Mueller and Oliver Hagner. 2002. Experiment about test-first programming. IEE Proceedings-Software 149, 5 (2002), 131–136.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816.
- Feedback-directed random test generation. In ICSE 2007, Proceedings of the 29th International Conference on Software Engineering. Minneapolis, MN, USA, 75–84.
- On the diffusion of test smells in automatically generated test code: An empirical study. In Proceedings of the 9th international workshop on search-based software testing. 5–14.
- Mutation testing advances: an analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. ([n. d.]).
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
- Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624 (2021).
- Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
- A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018), 1140–1144.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
- ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. arXiv preprint arXiv:2307.00588 (2023).
- Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
- Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350–354.
- Compilable neural code generation with compiler feedback. arXiv preprint arXiv:2203.05132 (2022).
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP. Association for Computational Linguistics, 8696–8708.
- On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1398–1409.
- ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
- No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv preprint arXiv:2305.04207 (2023).
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
- Wannita Takerngsaksiri (6 papers)
- Rujikorn Charakorn (7 papers)
- Chakkrit Tantithamthavorn (49 papers)
- Yuan-Fang Li (90 papers)