Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis (2404.04966v1)
Abstract: Automatic test generation plays a critical role in software quality assurance. While the recent advances in Search-Based Software Testing (SBST) and LLMs have shown promise in generating useful tests, these techniques still struggle to cover certain branches. Reaching these hard-to-cover branches usually requires constructing complex objects and resolving intricate inter-procedural dependencies in branch conditions, which poses significant challenges for existing test generation techniques. In this work, we propose TELPA, a novel technique aimed at addressing these challenges. Its key insight lies in extracting real usage scenarios of the target method under test to learn how to construct complex objects and extracting methods entailing inter-procedural dependencies with hard-to-cover branches to learn the semantics of branch constraints. To enhance efficiency and effectiveness, TELPA identifies a set of ineffective tests as counter-examples for LLMs and employs a feedback-based process to iteratively refine these counter-examples. Then, TELPA integrates program analysis results and counter-examples into the prompt, guiding LLMs to gain deeper understandings of the semantics of the target method and generate diverse tests that can reach the hard-to-cover branches. Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques, achieving an average improvement of 31.39% and 22.22% in terms of branch coverage.
- S. Lukasczyk and G. Fraser, “Pynguin: Automated unit test generation for python,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 168–172.
- C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” in Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816.
- G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, 2011, pp. 416–419.
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 919–931.
- Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,” arXiv preprint arXiv:2305.04207, 2023.
- G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage guided test generation in regression setting using llm,” arXiv preprint arXiv:2402.00097, 2024.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
- Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” arXiv preprint arXiv:2210.03493, 2022.
- “Phind-codellama-34b-v2,” 2024, https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.
- “Hugging face,” 2024, https://huggingface.co.
- “Pytorch,” 2024, http://pytorch.org.
- “Transformers,” 2024, https://github.com/huggingface/transformers.
- “Fastchat,” 2024, https://github.com/lm-sys/FastChat.
- C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Universal fuzzing via large language models,” arXiv preprint arXiv:2308.04748, 2023.
- C. Yang, Y. Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang, “White-box compiler fuzzing empowered by large language models,” arXiv preprint arXiv:2310.15991, 2023.
- A. Arcuri, “Many independent objective (mio) algorithm for test suite generation,” in Search Based Software Engineering: 9th International Symposium, SSBSE 2017, Paderborn, Germany, September 9-11, 2017, Proceedings 9. Springer, 2017, pp. 3–17.
- ——, “Test suite generation with the many independent objective (mio) algorithm,” Information and Software Technology, vol. 104, pp. 195–206, 2018.
- A. Panichella, F. M. Kifetew, and P. Tonella, “Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets,” IEEE Transactions on Software Engineering, vol. 44, no. 2, pp. 122–158, 2017.
- “Codex shutdown,” 2024, https://platform.openai.com/docs/deprecations.
- Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a chatgpt-based automated unit test generation tool,” arXiv preprint arXiv:2305.04764, 2023.
- “Deepseek,” 2024, https://github.com/deepseek-ai/DeepSeek-Coder.
- R. F. Woolson, “Wilcoxon signed-rank test,” Wiley encyclopedia of clinical trials, pp. 1–3, 2007.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Coco: Testing code generation systems via concretized instructions,” arXiv preprint arXiv:2308.13319, 2023.
- A. Hindle, E. T. Barr, M. Gabel, Z. Su, and P. Devanbu, “On the naturalness of software,” Communications of the ACM, vol. 59, no. 5, pp. 122–131, 2016.
- Y. Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering, vol. 37, no. 5, pp. 649–678, 2010.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” arXiv preprint arXiv:2305.06599, 2023.
- Y. K. Chia, G. Chen, L. A. Tuan, S. Poria, and L. Bing, “Contrastive chain-of-thought prompting,” arXiv preprint arXiv:2311.09277, 2023.
- B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
- R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
- Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x,” arXiv preprint arXiv:2303.17568, 2023.
- L. Della Toffola, C.-A. Staicu, and M. Pradel, “Saying ‘hi!’is not enough: Mining inputs for effective test generation,” in 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 44–49.
- Y. Lin, Y. S. Ong, J. Sun, G. Fraser, and J. S. Dong, “Graph-based seed object synthesis for search-based unit testing,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1068–1080.
- N. Tillmann and J. De Halleux, “Pex–white box test generation for. net,” in International conference on tests and proofs. Springer, 2008, pp. 134–153.
- J. P. Galeotti, G. Fraser, and A. Arcuri, “Improving search-based test suite generation with dynamic symbolic execution,” in 2013 ieee 24th international symposium on software reliability engineering (issre). IEEE, 2013, pp. 360–369.
- P. Braione, G. Denaro, A. Mattavelli, and M. Pezzè, “Combining symbolic execution and search-based testing for programs with complex heap inputs,” in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2017. New York, NY, USA: Association for Computing Machinery, 2017, p. 90–101, https://doi.org/10.1145/3092703.3092715.
- X. Li, Y. Liang, H. Qian, Y.-Q. Hu, L. Bu, Y. Yu, X. Chen, and X. Li, “Symbolic execution of complex program driven by machine learning based constraint solving,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 554–559, https://doi.org/10.1145/2970276.2970364.
- M. Baluda, G. Denaro, and M. Pezzè, “Bidirectional symbolic analysis for effective branch testing,” IEEE Transactions on Software Engineering, vol. 42, no. 5, pp. 403–426, 2016.
- M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers,” CoRR, vol. abs/2009.05617, 2020.
- S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test: Assertion-augmented automated test case generation,” arXiv preprint arXiv:2302.10352, 2023.
- W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2655–2668.
- Chen Yang (193 papers)
- Junjie Chen (89 papers)
- Bin Lin (33 papers)
- Jianyi Zhou (8 papers)
- Ziqi Wang (92 papers)