An LLM-based Readability Measurement for Unit Tests' Context-aware Inputs (2407.21369v2)
Abstract: Automated test techniques usually generate unit tests with higher code coverage than manual tests. However, the readability of automated tests is crucial for code comprehension and maintenance. The readability of unit tests involves many aspects. In this paper, we focus on test inputs. The central limitation of existing studies on input readability is that they focus on test codes alone without taking the tested source codes into consideration, making them either ignore different source codes' different readability requirements or require manual efforts to write readable inputs. However, we observe that the source codes specify the contexts that test inputs must satisfy. Based on such observation, we introduce the \underline{C}ontext \underline{C}onsistency \underline{C}riterion (a.k.a, C3), which is a readability measurement tool that leverages LLMs to extract primitive-type (including string-type) parameters' readability contexts from the source codes and checks whether test inputs are consistent with those contexts. We have also proposed EvoSuiteC3. It leverages C3's extracted contexts to help EvoSuite generate readable test inputs. We have evaluated C3's performance on $409$ \java{} classes and compared manual and automated tests' readability under C3 measurement. The results are two-fold. First, The Precision, Recall, and F1-Score of C3's mined readability contexts are \precision{}, \recall{}, and \fone{}, respectively. Second, under C3's measurement, the string-type input readability scores of EvoSuiteC3, ChatUniTest (an LLM-based test generation tool), manual tests, and two traditional tools (EvoSuite and Randoop) are $90\%$, $83\%$, $68\%$, $8\%$, and $8\%$, showing the traditional tools' inability in generating readable string-type inputs.
- Y. Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 572–576.
- G. Fraser and A. Arcuri, “Evosuite: Automatic test suite generation for object-oriented software,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, 2011, p. 416–419.
- C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” in Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816.
- Java Platform, “Uuid introduction,” 2024. [Online]. Available: https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html
- A. Panichella, F. M. Kifetew, and P. Tonella, “Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets,” IEEE Transactions on Software Engineering, vol. 44, pp. 122–158, 2018.
- S. Wang, N. Shrestha, A. K. Subburaman, J. Wang, M. Wei, and N. Nagappan, “Automatic unit test generation for machine learning libraries: How far are we?” in Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering, 2021, pp. 1548–1560.
- J. Zhang, Y. Lou, L. Zhang, D. Hao, L. Zhang, and H. Mei, “Isomorphic regression testing: executing uncovered branches without test augmentation,” in Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 883–894.
- A. Salahirad, H. Almulla, and G. Gay, “Choosing the fitness function for the job: Automated generation of test suites that detect real faults,” Software Testing, Verification and Reliability, vol. 29, no. 4-5, p. e1701, 2019.
- P. McMinn, M. Stevenson, and M. Harman, “Reducing qualitative human oracle costs associated with automatically generated test data,” in Proceedings of the First International Workshop on Software Test Output Validation, 2010, pp. 1–4.
- S. Afshan, P. McMinn, and M. Stevenson, “Evolving readable string test inputs using a natural language model to reduce human oracle cost,” in Proceedings of the Sixth IEEE International Conference on Software Testing, Verification and Validation, 2013, pp. 352–361.
- E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2014.
- E. Daka, J. M. Rojas, and G. Fraser, “Generating unit tests with descriptive names or: Would you name your children thing1 and thing2?” in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2017, pp. 57–67.
- S. Panichella, A. Panichella, M. Beller, A. Zaidman, and H. C. Gall, “The impact of test case summaries on bug fixing performance: An empirical investigation,” in Proceedings of the 38th international conference on software engineering, 2016, pp. 547–558.
- D. Roy, Z. Zhang, M. Ma, V. Arnaoudova, A. Panichella, S. Panichella, D. Gonzalez, and M. Mirakhorli, “Deeptc-enhancer: Improving the readability of automatically generated tests,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 287–298.
- B. Li, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, and N. A. Kraft, “Automatically documenting unit test cases,” in 2016 IEEE international conference on software testing, verification and validation, 2016, pp. 341–352.
- G. Gay, “Improving the readability of generated tests using gpt-4 and chatgpt code interpreter,” in International Symposium on Search Based Software Engineering. Springer, 2023, pp. 140–146.
- E. Daka, J. Campos, G. Fraser, J. Dorn, and W. Weimer, “Modeling readability to improve unit tests,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 107–118.
- A. Alsharif, G. M. Kapfhammer, and P. McMinn, “What factors make sql test cases understandable for testers? a human study of automated test data generation techniques,” in Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution, 2019, pp. 437–448.
- M. M. Almasi, H. Hemmati, G. Fraser, A. Arcuri, and J. Benefelds, “An industrial evaluation of unit test generation: Finding real faults in a financial application,” in Proceedings of the 39th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice Track, 2017, pp. 263–272.
- D. Winkler, P. Urbanke, and R. Ramler, “Investigating the readability of test code,” Empirical Software Engineering, vol. 29, no. 2, p. 53, 2024.
- Online, “A constructor method from airsonic: A free, web-based media streamer,” 2024. [Online]. Available: https://tinyurl.com/5d2fnc4w
- A. Deljouyi and A. Zaidman, “Generating understandable unit tests through end-to-end test scenario carving,” in 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation. IEEE, 2023, pp. 107–118.
- M. Leotta, H. Z. Yousaf, F. Ricca, and B. Garcia, “Ai-generated test scripts for web e2e testing with chatgpt and copilot: A preliminary study,” in Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 339–344.
- C. Augusto, “Efficient test execution in end to end testing: Resource optimization in end to end testing through a smart resource characterization and orchestration,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings, 2020, pp. 152–154.
- G. Fraser and A. Arcuri, “Whole test suite generation,” IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 276–291, 2013.
- F. E. Allen and J. Cocke, “A program data flow analysis procedure,” Communications of the ACM, vol. 19, no. 3, p. 137, 1976.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” in Proceedings of the IEEE/ACM 29th International Conference on Software Engineering, 2007, pp. 75–84.
- C. Pacheco and M. D. Ernst, “Eclat: Automatic generation and classification of test inputs,” in Proceedings of the 19th European Conference on Object-Oriented Programming. Springer, 2005, pp. 504–527.
- J. H. Andrews, T. Menzies, and F. C. Li, “Genetic algorithms for randomized unit testing,” IEEE Transactions on software engineering, vol. 37, no. 1, pp. 80–94, 2011.
- K. Havelund and T. Pressburger, “Model checking java programs using java pathfinder,” International Journal on Software Tools for Technology Transfer, vol. 2, pp. 366–381, 2000.
- P. Braione, G. Denaro, and M. Pezzè, “Jbse: A symbolic executor for java programs with complex heap inputs,” in Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016, pp. 1018–1022.
- P. Tonella, “Evolutionary testing of classes,” in Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis, 2004, p. 119–128.
- A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch coverage as a many-objective optimization problem,” in Proceedings of the 8th IEEE international conference on software testing, verification and validation, 2015, pp. 1–10.
- Y. Lin, Y. S. Ong, J. Sun, G. Fraser, and J. S. Dong, “Graph-based seed object synthesis for search-based unit testing,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, p. 1068–1080.
- R. P. Buse and W. R. Weimer, “Learning a metric for code readability,” IEEE Transactions on software engineering, vol. 36, no. 4, pp. 546–558, 2009.
- G. Grano, S. Scalabrino, H. C. Gall, and R. Oliveto, “An empirical investigation on the readability of manual and generated test cases,” in Proceedings of the 26th Conference on Program Comprehension, no. 4, 2018, p. 348–351.
- S. Scalabrino, M. Linares-Vásquez, D. Poshyvanyk, and R. Oliveto, “Improving code readability models with textual features,” in 2016 IEEE 24th International Conference on Program Comprehension, 2016, pp. 1–10.
- C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The stanford corenlp natural language processing toolkit,” in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014, pp. 55–60.
- Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “Chatgpt vs sbst: A comparative assessment of unit test suite generation,” IEEE Transactions on Software Engineering, 2024.
- C. E. C. Dantas and M. A. Maia, “Readability and understandability scores for snippet assessment: an exploratory study,” arXiv preprint arXiv:2108.09181, 2021.
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in Proceedings of the 45th IEEE/ACM International Conference on Software Engineering, 2023.
- M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” arXiv preprint arXiv:2009.05617, 2020.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020, pp. 7871–7880.
- J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Transactions on Software Engineering, 2024.
- Z. Yuan, M. Liu, S. Ding, K. Wang, Y. Chen, X. Peng, and Y. Lou, “Evaluating and improving chatgpt for unit test generation,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1703–1726, 2024.
- Online, “Javaparser: A java 1.0 - java 18 parser,” 2024. [Online]. Available: http://tinyurl.com/5n7h2pnb
- E. T. K. Sang and F. De Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147.
- A. Ritter, S. Clark, O. Etzioni et al., “Named entity recognition in tweets: an experimental study,” in Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 1524–1534.
- S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, and G. Wang, “Gpt-ner: Named entity recognition via large language models,” arXiv preprint arXiv:2304.10428, 2023.
- T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-summarization,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5.
- ——, “Better patching using llm prompting, via self-consistency,” in Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, 2023, pp. 1742–1746.
- X. Liu, D. McDuff, G. Kovacs, I. Galatzer-Levy, J. Sunshine, J. Zhan, M.-Z. Poh, S. Liao, P. Di Achille, and S. Patel, “Large language models are few-shot health learners,” arXiv preprint arXiv:2305.15525, 2023.
- J. M. Rojas, J. Campos, M. Vivanti, G. Fraser, and A. Arcuri, “Combining multiple coverage criteria in search-based unit test generation,” in Search-Based Software Engineering, M. Barros and Y. Labiche, Eds., 2015, pp. 93–108.
- G. Fraser and A. Arcuri, “Achieving scalable mutation-based generation of whole test suites,” Empirical Software Engineering, vol. 20, no. 3, p. 783–812, 2015.
- W. E. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage.” 1990.
- J. Campos, Y. Ge, N. Albunian, G. Fraser, M. Eler, and A. Arcuri, “An empirical evaluation of evolutionary algorithms for unit test suite generation,” Information and Software Technology, vol. 104, pp. 207–235, 2018.
- Online, “Airsonic: A free, web-based media streamer,” 2024. [Online]. Available: https://tinyurl.com/4ye4xsmn
- ——, “Broadleafcommerce: An e-commerce framework,” 2024. [Online]. Available: https://tinyurl.com/3j2345sb
- ——, “Googlemapservice: Client for google maps,” 2024. [Online]. Available: http://tinyurl.com/y99bzypc
- ——, “Openmrs: A patient-based medical record system,” 2024. [Online]. Available: https://tinyurl.com/mrxr69s2
- ——, “Openrefine: A data manipulator,” 2024. [Online]. Available: https://tinyurl.com/4xev45tz
- ——, “Petclinic: A spring sample application,” 2024. [Online]. Available: https://tinyurl.com/5d9amebv
- ——, “Shopizer: An e-commerce software,” 2024. [Online]. Available: https://tinyurl.com/33z3sdvs
- OpenAI, “Gpt series,” 2024. [Online]. Available: https://tinyurl.com/vcfzar2f
- D. Wang, S. Li, G. Xiao, Y. Liu, and Y. Sui, “An exploratory study of autopilot software bugs in unmanned aerial vehicles,” in Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2021, pp. 20–31.
- D. M. W. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,” arXiv preprint arXiv:2010.16061, 2020.
- J. Cohen, “Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit.” Psychological bulletin, vol. 70, no. 4, p. 213, 1968.
- G. Landis JRKoch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, p. 159174, 1977.
- M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.