Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Write Good Property-Based Tests? (2307.04346v2)

Published 10 Jul 2023 in cs.SE

Abstract: Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As LLMs have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. K. Claessen and J. Hughes, “Quickcheck: a lightweight tool for random testing of haskell programs,” in Proceedings of the fifth ACM SIGPLAN international conference on Functional programming, 2000, pp. 268–279.
  2. T. Arts, J. Hughes, J. Johansson, and U. Wiger, “Testing telecoms software with quviq quickcheck,” in Proceedings of the 2006 ACM SIGPLAN Workshop on Erlang, 2006, pp. 2–10.
  3. T. Arts, J. Hughes, U. Norell, and H. Svensson, “Testing autosar software with quickcheck,” in 2015 IEEE Eighth International Conference on Software Testing, Verification and Validation Workshops (ICSTW).   IEEE, 2015, pp. 1–4.
  4. J. Hughes, “Experiences with quickcheck: testing the hard stuff and staying sane,” in A List of Successes That Can Change the World: Essays Dedicated to Philip Wadler on the Occasion of His 60th Birthday.   Springer, 2016, pp. 169–186.
  5. J. Hughes, B. C. Pierce, T. Arts, and U. Norell, “Mysteries of dropbox: property-based testing of a distributed synchronization service,” in 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).   IEEE, 2016, pp. 135–145.
  6. R. Padhye, C. Lemieux, and K. Sen, “JQF: coverage-guided property-based testing in Java,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 398–401.
  7. D. R. MacIver, Z. Hatfield-Dodds et al., “Hypothesis: A new approach to property-based testing,” Journal of Open Source Software, vol. 4, no. 43, p. 1891, 2019.
  8. L. Lampropoulos, M. Hicks, and B. C. Pierce, “Coverage guided, property based testing,” Proceedings of the ACM on Programming Languages, vol. 3, no. OOPSLA, pp. 1–29, 2019.
  9. Google, “Open Source Insights,” https://deps.dev/, retrieved April 27, 2023.
  10. H. Goldstein, J. W. Cutler, A. Stein, B. C. Pierce, and A. Head, “Some problems with properties,” vol. 3, no. HATRA, 2022.
  11. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021.
  12. D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” arXiv preprint arXiv:2204.05999, 2022.
  13. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” 2023.
  14. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  15. OpenAI, “Gpt-4 technical report,” 2023.
  16. C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 45th International Conference on Software Engineering, ser. ICSE, 2023.
  17. S. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh, M. Musuvathi, J. P. Inala, C. Wang, and J. Gao, “Interactive code generation via test-driven user-intent formalization,” arXiv, August 2022. [Online]. Available: https://www.microsoft.com/en-us/research/publication/interactive-code-generation-via-test-driven-user-intent-formalization/
  18. M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” 2023.
  19. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  20. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  21. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  22. R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  23. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  24. F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10.
  25. R. Li, L. Ben Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Randy, M.-H. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, U. Bhattacharyya, S. Gunasekar, W. Yu, S. Singh, S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “STARCODER: May the Source be With You!” 2023. [Online]. Available: https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view
  26. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
  27. J. A. Prenner and R. Robbes, “Automatic Program Repair with OpenAI’s Codex: Evaluating QuixBugs,” arXiv preprint arXiv:2111.03922, 2021.
  28. H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?” arXiv preprint arXiv:2112.02125, 2021.
  29. ——, “Examining zero-shot vulnerability repair with large language models,” in 2023 2023 IEEE Symposium on Security and Privacy (SP) (SP).   Los Alamitos, CA, USA: IEEE Computer Society, may 2023, pp. 1–18. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.00001
  30. S. Sarsa, P. Denny, A. Hellas, and J. Leinonen, “Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models,” in Proceedings of the 2022 ACM Conference on International Computing Education Research V. 1, 2022, pp. 27–43.
  31. J. B. Goodenough and S. L. Gerhart, “Toward a theory of test data selection,” in Proceedings of the international conference on Reliable software, 1975, pp. 493–510.
  32. P. G. Frankl and O. Iakounenko, “Further empirical studies of test effectiveness,” in Proceedings of the 6th ACM SIGSOFT international symposium on Foundations of software engineering, 1998, pp. 153–162.
  33. G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test generation using evosuite,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 24, no. 2, pp. 1–42, 2014.
  34. S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t),” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2015, pp. 201–211.
  35. P. Bareiß, B. Souza, M. d’Amorim, and M. Pradel, “Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code,” arXiv preprint arXiv:2206.01335, 2022.
  36. C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for Java,” in Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816.
  37. G. Fraser and A. Zeller, “Mutation-driven generation of unit tests and oracles,” in Proceedings of the 19th International Symposium on Software Testing and Analysis, ser. ISSTA ’10.   New York, NY, USA: Association for Computing Machinery, 2010, p. 147–158. [Online]. Available: https://doi.org/10.1145/1831708.1831728
  38. G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, 2011, pp. 416–419.
  39. S. Lukasczyk and G. Fraser, “Pynguin: Automated unit test generation for python,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 168–172.
  40. A. Goffi, A. Gorla, M. D. Ernst, and M. Pezzè, “Automatic generation of oracles for exceptional behaviors,” in Proceedings of the 25th International Symposium on Software Testing and Analysis, ser. ISSTA 2016.   New York, NY, USA: Association for Computing Machinery, 2016, p. 213–224. [Online]. Available: https://doi.org/10.1145/2931037.2931061
  41. A. Blasi, A. Gorla, M. D. Ernst, M. Pezzè, and A. Carzaniga, “MeMo: Automatically identifying metamorphic relations in Javadoc comments for test automation,” Journal of Systems and Software, vol. 181, p. 111041, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121221001382
  42. A. Blasi, A. Gorla, M. D. Ernst, and M. Pezzè, “Call Me Maybe: Using NLP to Automatically Generate Unit Test Cases Respecting Temporal Constraints,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22.   New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3551349.3556961
  43. C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, “On learning meaningful assert statements for unit test cases,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 1398–1409.
  44. E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, “Toga: a neural method for test oracle generation,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 2130–2141.
  45. R. Padhye, C. Lemieux, K. Sen, M. Papadakis, and Y. Le Traon, “Semantic Fuzzing with Zest,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 329–340.
  46. S. Reddy, C. Lemieux, R. Padhye, and K. Sen, “Quickly generating diverse valid test inputs with reinforcement learning,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 1410–1421.
  47. H. L. Nguyen and L. Grunske, “Bedivfuzz: integrating behavioral diversity into generator-based fuzzing,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 249–261.
  48. D. Babić, S. Bucur, Y. Chen, F. Ivančić, T. King, M. Kusano, C. Lemieux, L. Szekeres, and W. Wang, “Fudge: fuzz driver generation at scale,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 975–985.
  49. K. K. Ispoglou, D. Austin, V. Mohan, and M. Payer, “Fuzzgen: Automatic fuzzer generation,” in Proceedings of the 29th USENIX Conference on Security Symposium, 2020, pp. 2271–2287.
  50. B. Jeong, J. Jang, H. Yi, J. Moon, J. Kim, I. Jeon, T. Kim, W. Shim, and Y. Hwang, “Utopia: Automatic generation of fuzz driver using unit tests,” in 2023 2023 IEEE Symposium on Security and Privacy (SP) (SP).   Los Alamitos, CA, USA: IEEE Computer Society, may 2023, pp. 746–762. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.00043
Citations (22)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com