Chat-like Asserts Prediction with the Support of Large Language Model (2407.21429v1)
Abstract: Unit testing is an essential component of software testing, with the assert statements playing an important role in determining whether the tested function operates as expected. Although research has explored automated test case generation, generating meaningful assert statements remains an ongoing challenge. While several studies have investigated assert statement generation in Java, limited work addresses this task in popular dynamically-typed programming languages like Python. In this paper, we introduce Chat-like execution-based Asserts Prediction (\tool), a novel LLM-based approach for generating meaningful assert statements for Python projects. \tool utilizes the persona, Chain-of-Thought, and one-shot learning techniques in the prompt design, and conducts rounds of communication with LLM and Python interpreter to generate meaningful assert statements. We also present a Python assert statement dataset mined from GitHub. Our evaluation demonstrates that \tool achieves 64.7\% accuracy for single assert statement generation and 62\% for overall assert statement generation, outperforming the existing approaches. We also analyze the mismatched assert statements, which may still share the same functionality and discuss the potential help \tool could offer to the automated Python unit test generation. The findings indicate that \tool has the potential to benefit the SE community through more practical usage scenarios.
- 2012. Arrange Act Assert. http://wiki.c2.com/?ArrangeActAssert. Accessed: 2023-07-29.
- 2020. One assertion per test? https://pierodibello.medium.com/one-assertion-per-test-732cc2a7d3d. Accessed: 2023-07-29.
- 2022a. GitHub REST API. https://docs.github.com/en/rest. Accessed: 2023-04-09.
- 2022b. Write Professional Unit Tests in Python. https://code.tutsplus.com/write-professional-unit-tests-in-python--cms-25835t. Accessed: 2023-07-29.
- 2023a. awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt-prompts. Accessed: 2023-03-29.
- 2023b. CLAP Github Repository. https://github.com/freddiewanah/CLAP. Accessed: 2023-10-06.
- 2023c. Codium AI Generating meaningful tests for busy devs. https://www.codium.ai/. Accessed: 2023-04-03.
- 2023d. A Gentle Introduction to k-fold Cross-Validation. https://machinelearningmastery.com/k-fold-cross-validation/. Accessed: 2023-05-01.
- 2023e. How to Write Unit Tests. https://www.wikihow.com/Write-Unit-Tests. Accessed: 2023-07-03.
- 2023f. Module: tf.test Tensorflow. https://www.tensorflow.org/api_docs/python/tf/test. Accessed: 2023-07-03.
- 2023g. OpenAI Models. https://platform.openai.com/docs/models/overview. Accessed: 2023-07-03.
- 2023h. pipreqs · PyPI. https://pypi.org/project/pipreqs/. Accessed: 2023-07-03.
- 2023i. pytest: helps you write better programs. https://docs.pytest.org/en/7.3.x/. Accessed: 2023-07-03.
- 2023j. scikit-learn Machine Learning in Python. https://scikit-learn.org/stable/index.html. Accessed: 2023-07-03.
- 2023k. TIOBE Index for March 2023. https://www.tiobe.com/tiobe-index/. Accessed: 2023-03-29.
- 2023l. unittest — Unit testing framework. https://docs.python.org/3/library/unittest.html. Accessed: 2023-07-03.
- 2024a. Gemini. https://ai.google.dev/gemini-api/docs/models/gemini. Accessed: 2024-05-29.
- 2024b. OpenAI Models. https://platform.openai.com/docs/models. Accessed: 2024-05-29.
- Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
- Test smell detection tools: A systematic mapping study. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering. 170–180.
- An industrial evaluation of unit test generation: Finding real faults in a financial application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 263–272.
- What do the asserts in a unit test tell us about code quality? a study on open source and industrial projects. In 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 111–120.
- Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272). IEEE, 368–377.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- A study on Prompt Design, Advantages and Limitations of ChatGPT for Deep Learning Program Repair. arXiv preprint arXiv:2304.08191 (2023).
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering (2023).
- Saikat Chakraborty and Baishakhi Ray. 2021. On multi-modal learning of editing source code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 443–455.
- Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 201–211.
- ReAssert: Suggesting repairs for broken unit tests. In 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 433–444.
- Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering. 2130–2141.
- Patching as translation: the data and the metaphor. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 275–286.
- Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
- Davide Fucci and Burak Turhan. 2014. On the role of tests in test-driven development: a differentiated and partial replication. Empirical Software Engineering 19 (2014), 277–302.
- An Empirical Study on Using Large Language Models for Multi-Intent Comment Generation. arXiv preprint arXiv:2304.11384 (2023).
- GitHub. 2022. GitHub Copilot · Your AI pair programmer. https://github.com/features/copilot. Accessed: 2023-04-03.
- A large-scale study on the usage of testing patterns that address maintainability attributes: patterns for ease of modification, diagnoses, and comprehension. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 391–401.
- The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.
- Junaed Younus Khan and Gias Uddin. 2022. Automatic Code Documentation Generation Using GPT-3. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–6.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022).
- CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 45th International Conference on Software Engineering, ser. ICSE.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
- SkCoder: A Sketch-based Approach for Automatic Code Generation. arXiv preprint arXiv:2302.06144 (2023).
- Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–23.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
- An empirical study of automated unit test generation for Python. Empirical Software Engineering 28, 2 (2023), 36.
- David R MacIver and Alastair F Donaldson. 2020. Test-case reduction via test-case generation: Insights from the hypothesis reducer (tool insights paper). In 34th European Conference on Object-Oriented Programming (ECOOP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
- Paula Maddigan and Teo Susnjak. 2023. Chat2vis: Generating data visualisations via natural language using chatgpt, codex and gpt-3 large language models. arXiv preprint arXiv:2302.02094 (2023).
- Using transfer learning for code-related tasks. IEEE Transactions on Software Engineering 49, 4 (2022), 1580–1598.
- Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336–347.
- Sharan Narang and Aakanksha Chowdhery. 2022. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance.
- Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23).
- Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s code suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories. 1–5.
- LEVER: Learning to Verify Language-to-Code Generation with Execution. arXiv preprint arXiv:2302.08468 (2023).
- Test smells 20 years later: detectability, validity, and reliability. Empirical Software Engineering 27, 7 (2022), 170.
- ART: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014 (2023).
- Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
- Recovering traceability links between unit tests and classes under test: An improved method. In 2010 IEEE International Conference on Software Maintenance. IEEE, 1–10.
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
- An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).
- David Schuler and Andreas Zeller. 2011. Assessing oracle quality with checked coverage. In 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 90–99.
- Carolyn B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Transactions on software engineering 25, 4 (1999), 557–572.
- Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 201–211.
- Kunal Taneja and Tao Xie. 2008. DiffGen: Automated regression unit-test generation. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 407–410.
- On the effectiveness of unit tests in test-driven development. In Proceedings of the 2018 International Conference on Software and System Process. 113–122.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Fabian Trautsch and Jens Grabowski. 2017. Are there any unit tests? an empirical study on unit testing in open source python projects. In 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 207–218.
- Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test. 54–64.
- Refactoring test code. In Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001). Citeseer, 92–95.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th International Conference on Software Engineering. 287–298.
- Pynose: a test smell detector for python. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 593–605.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
- On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1398–1409.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- James A Whittaker. 2000. What is software testing? And why is it so hard? IEEE software 17, 1 (2000), 70–79.
- BugsInPy: A database of existing bugs in Python programs to enable controlled testing and debugging studies. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 1556–1560.
- Automated program repair in the era of large pre-trained language models. In Proceedings of the ACM/IEEE 45th International Conference on Software Engineering (ICSE’23).
- FSX: fine-grained incremental unit test generation for C/C++ programs. In Proceedings of the 25th international symposium on software testing and analysis. 106–117.
- Automated assertion generation via information retrieval and its integration with deep learning. In Proceedings of the 44th International Conference on Software Engineering. 163–174.
- Dynamic Human-in-the-Loop Assertion Generation. IEEE Transactions on Software Engineering 49, 4 (2022), 2337–2351.
- Jerrold H Zar. 1999. Biostatistical analysis. Pearson Education India.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).