Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An LLM-based Readability Measurement for Unit Tests' Context-aware Inputs (2407.21369v2)

Published 31 Jul 2024 in cs.SE

Abstract: Automated test techniques usually generate unit tests with higher code coverage than manual tests. However, the readability of automated tests is crucial for code comprehension and maintenance. The readability of unit tests involves many aspects. In this paper, we focus on test inputs. The central limitation of existing studies on input readability is that they focus on test codes alone without taking the tested source codes into consideration, making them either ignore different source codes' different readability requirements or require manual efforts to write readable inputs. However, we observe that the source codes specify the contexts that test inputs must satisfy. Based on such observation, we introduce the \underline{C}ontext \underline{C}onsistency \underline{C}riterion (a.k.a, C3), which is a readability measurement tool that leverages LLMs to extract primitive-type (including string-type) parameters' readability contexts from the source codes and checks whether test inputs are consistent with those contexts. We have also proposed EvoSuiteC3. It leverages C3's extracted contexts to help EvoSuite generate readable test inputs. We have evaluated C3's performance on $409$ \java{} classes and compared manual and automated tests' readability under C3 measurement. The results are two-fold. First, The Precision, Recall, and F1-Score of C3's mined readability contexts are \precision{}, \recall{}, and \fone{}, respectively. Second, under C3's measurement, the string-type input readability scores of EvoSuiteC3, ChatUniTest (an LLM-based test generation tool), manual tests, and two traditional tools (EvoSuite and Randoop) are $90\%$, $83\%$, $68\%$, $8\%$, and $8\%$, showing the traditional tools' inability in generating readable string-type inputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Y. Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 572–576.
  2. G. Fraser and A. Arcuri, “Evosuite: Automatic test suite generation for object-oriented software,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, 2011, p. 416–419.
  3. C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” in Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816.
  4. Java Platform, “Uuid introduction,” 2024. [Online]. Available: https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html
  5. A. Panichella, F. M. Kifetew, and P. Tonella, “Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets,” IEEE Transactions on Software Engineering, vol. 44, pp. 122–158, 2018.
  6. S. Wang, N. Shrestha, A. K. Subburaman, J. Wang, M. Wei, and N. Nagappan, “Automatic unit test generation for machine learning libraries: How far are we?” in Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering, 2021, pp. 1548–1560.
  7. J. Zhang, Y. Lou, L. Zhang, D. Hao, L. Zhang, and H. Mei, “Isomorphic regression testing: executing uncovered branches without test augmentation,” in Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 883–894.
  8. A. Salahirad, H. Almulla, and G. Gay, “Choosing the fitness function for the job: Automated generation of test suites that detect real faults,” Software Testing, Verification and Reliability, vol. 29, no. 4-5, p. e1701, 2019.
  9. P. McMinn, M. Stevenson, and M. Harman, “Reducing qualitative human oracle costs associated with automatically generated test data,” in Proceedings of the First International Workshop on Software Test Output Validation, 2010, pp. 1–4.
  10. S. Afshan, P. McMinn, and M. Stevenson, “Evolving readable string test inputs using a natural language model to reduce human oracle cost,” in Proceedings of the Sixth IEEE International Conference on Software Testing, Verification and Validation, 2013, pp. 352–361.
  11. E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2014.
  12. E. Daka, J. M. Rojas, and G. Fraser, “Generating unit tests with descriptive names or: Would you name your children thing1 and thing2?” in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2017, pp. 57–67.
  13. S. Panichella, A. Panichella, M. Beller, A. Zaidman, and H. C. Gall, “The impact of test case summaries on bug fixing performance: An empirical investigation,” in Proceedings of the 38th international conference on software engineering, 2016, pp. 547–558.
  14. D. Roy, Z. Zhang, M. Ma, V. Arnaoudova, A. Panichella, S. Panichella, D. Gonzalez, and M. Mirakhorli, “Deeptc-enhancer: Improving the readability of automatically generated tests,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 287–298.
  15. B. Li, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, and N. A. Kraft, “Automatically documenting unit test cases,” in 2016 IEEE international conference on software testing, verification and validation, 2016, pp. 341–352.
  16. G. Gay, “Improving the readability of generated tests using gpt-4 and chatgpt code interpreter,” in International Symposium on Search Based Software Engineering.   Springer, 2023, pp. 140–146.
  17. E. Daka, J. Campos, G. Fraser, J. Dorn, and W. Weimer, “Modeling readability to improve unit tests,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 107–118.
  18. A. Alsharif, G. M. Kapfhammer, and P. McMinn, “What factors make sql test cases understandable for testers? a human study of automated test data generation techniques,” in Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution, 2019, pp. 437–448.
  19. M. M. Almasi, H. Hemmati, G. Fraser, A. Arcuri, and J. Benefelds, “An industrial evaluation of unit test generation: Finding real faults in a financial application,” in Proceedings of the 39th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice Track, 2017, pp. 263–272.
  20. D. Winkler, P. Urbanke, and R. Ramler, “Investigating the readability of test code,” Empirical Software Engineering, vol. 29, no. 2, p. 53, 2024.
  21. Online, “A constructor method from airsonic: A free, web-based media streamer,” 2024. [Online]. Available: https://tinyurl.com/5d2fnc4w
  22. A. Deljouyi and A. Zaidman, “Generating understandable unit tests through end-to-end test scenario carving,” in 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation.   IEEE, 2023, pp. 107–118.
  23. M. Leotta, H. Z. Yousaf, F. Ricca, and B. Garcia, “Ai-generated test scripts for web e2e testing with chatgpt and copilot: A preliminary study,” in Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 339–344.
  24. C. Augusto, “Efficient test execution in end to end testing: Resource optimization in end to end testing through a smart resource characterization and orchestration,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings, 2020, pp. 152–154.
  25. G. Fraser and A. Arcuri, “Whole test suite generation,” IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 276–291, 2013.
  26. F. E. Allen and J. Cocke, “A program data flow analysis procedure,” Communications of the ACM, vol. 19, no. 3, p. 137, 1976.
  27. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  28. C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” in Proceedings of the IEEE/ACM 29th International Conference on Software Engineering, 2007, pp. 75–84.
  29. C. Pacheco and M. D. Ernst, “Eclat: Automatic generation and classification of test inputs,” in Proceedings of the 19th European Conference on Object-Oriented Programming.   Springer, 2005, pp. 504–527.
  30. J. H. Andrews, T. Menzies, and F. C. Li, “Genetic algorithms for randomized unit testing,” IEEE Transactions on software engineering, vol. 37, no. 1, pp. 80–94, 2011.
  31. K. Havelund and T. Pressburger, “Model checking java programs using java pathfinder,” International Journal on Software Tools for Technology Transfer, vol. 2, pp. 366–381, 2000.
  32. P. Braione, G. Denaro, and M. Pezzè, “Jbse: A symbolic executor for java programs with complex heap inputs,” in Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016, pp. 1018–1022.
  33. P. Tonella, “Evolutionary testing of classes,” in Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis, 2004, p. 119–128.
  34. A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch coverage as a many-objective optimization problem,” in Proceedings of the 8th IEEE international conference on software testing, verification and validation, 2015, pp. 1–10.
  35. Y. Lin, Y. S. Ong, J. Sun, G. Fraser, and J. S. Dong, “Graph-based seed object synthesis for search-based unit testing,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, p. 1068–1080.
  36. R. P. Buse and W. R. Weimer, “Learning a metric for code readability,” IEEE Transactions on software engineering, vol. 36, no. 4, pp. 546–558, 2009.
  37. G. Grano, S. Scalabrino, H. C. Gall, and R. Oliveto, “An empirical investigation on the readability of manual and generated test cases,” in Proceedings of the 26th Conference on Program Comprehension, no. 4, 2018, p. 348–351.
  38. S. Scalabrino, M. Linares-Vásquez, D. Poshyvanyk, and R. Oliveto, “Improving code readability models with textual features,” in 2016 IEEE 24th International Conference on Program Comprehension, 2016, pp. 1–10.
  39. C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The stanford corenlp natural language processing toolkit,” in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014, pp. 55–60.
  40. Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “Chatgpt vs sbst: A comparative assessment of unit test suite generation,” IEEE Transactions on Software Engineering, 2024.
  41. C. E. C. Dantas and M. A. Maia, “Readability and understandability scores for snippet assessment: an exploratory study,” arXiv preprint arXiv:2108.09181, 2021.
  42. C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in Proceedings of the 45th IEEE/ACM International Conference on Software Engineering, 2023.
  43. M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” arXiv preprint arXiv:2009.05617, 2020.
  44. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.   Online: ACL, 2020, pp. 7871–7880.
  45. J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Transactions on Software Engineering, 2024.
  46. Z. Yuan, M. Liu, S. Ding, K. Wang, Y. Chen, X. Peng, and Y. Lou, “Evaluating and improving chatgpt for unit test generation,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1703–1726, 2024.
  47. Online, “Javaparser: A java 1.0 - java 18 parser,” 2024. [Online]. Available: http://tinyurl.com/5n7h2pnb
  48. E. T. K. Sang and F. De Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147.
  49. A. Ritter, S. Clark, O. Etzioni et al., “Named entity recognition in tweets: an experimental study,” in Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 1524–1534.
  50. S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, and G. Wang, “Gpt-ner: Named entity recognition via large language models,” arXiv preprint arXiv:2304.10428, 2023.
  51. T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-summarization,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5.
  52. ——, “Better patching using llm prompting, via self-consistency,” in Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, 2023, pp. 1742–1746.
  53. X. Liu, D. McDuff, G. Kovacs, I. Galatzer-Levy, J. Sunshine, J. Zhan, M.-Z. Poh, S. Liao, P. Di Achille, and S. Patel, “Large language models are few-shot health learners,” arXiv preprint arXiv:2305.15525, 2023.
  54. J. M. Rojas, J. Campos, M. Vivanti, G. Fraser, and A. Arcuri, “Combining multiple coverage criteria in search-based unit test generation,” in Search-Based Software Engineering, M. Barros and Y. Labiche, Eds., 2015, pp. 93–108.
  55. G. Fraser and A. Arcuri, “Achieving scalable mutation-based generation of whole test suites,” Empirical Software Engineering, vol. 20, no. 3, p. 783–812, 2015.
  56. W. E. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage.” 1990.
  57. J. Campos, Y. Ge, N. Albunian, G. Fraser, M. Eler, and A. Arcuri, “An empirical evaluation of evolutionary algorithms for unit test suite generation,” Information and Software Technology, vol. 104, pp. 207–235, 2018.
  58. Online, “Airsonic: A free, web-based media streamer,” 2024. [Online]. Available: https://tinyurl.com/4ye4xsmn
  59. ——, “Broadleafcommerce: An e-commerce framework,” 2024. [Online]. Available: https://tinyurl.com/3j2345sb
  60. ——, “Googlemapservice: Client for google maps,” 2024. [Online]. Available: http://tinyurl.com/y99bzypc
  61. ——, “Openmrs: A patient-based medical record system,” 2024. [Online]. Available: https://tinyurl.com/mrxr69s2
  62. ——, “Openrefine: A data manipulator,” 2024. [Online]. Available: https://tinyurl.com/4xev45tz
  63. ——, “Petclinic: A spring sample application,” 2024. [Online]. Available: https://tinyurl.com/5d9amebv
  64. ——, “Shopizer: An e-commerce software,” 2024. [Online]. Available: https://tinyurl.com/33z3sdvs
  65. OpenAI, “Gpt series,” 2024. [Online]. Available: https://tinyurl.com/vcfzar2f
  66. D. Wang, S. Li, G. Xiao, Y. Liu, and Y. Sui, “An exploratory study of autopilot software bugs in unmanned aerial vehicles,” in Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2021, pp. 20–31.
  67. D. M. W. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,” arXiv preprint arXiv:2010.16061, 2020.
  68. J. Cohen, “Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit.” Psychological bulletin, vol. 70, no. 4, p. 213, 1968.
  69. G. Landis JRKoch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, p. 159174, 1977.
  70. M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com