Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TOGLL: Correct and Strong Test Oracle Generation with LLMs (2405.03786v2)

Published 6 May 2024 in cs.SE

Abstract: Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have demonstrated impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on the SF110 dataset. Utilizing the most effective fine-tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Moreover, our findings demonstrate that TOGLL is capable of generating significantly diverse test oracles. It can detect 1,023 unique bugs that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Is mutation an appropriate tool for testing experiments? [software testing]. In Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005., pages 402–411, 2005.
  2. Nessie: Automatically testing javascript apis with asynchronous callbacks. In Proceedings of the 44th International Conference on Software Engineering, pages 1494–1505, 2022.
  3. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering, 41(5):507–525, 2015.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
  5. Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018, page 242–253, New York, NY, USA, 2018. Association for Computing Machinery.
  6. Memo: Automatically identifying metamorphic relations in javadoc comments for test automation. Journal of Systems and Software, 181:111041, 2021.
  7. CodeParrot. https://huggingface.co/codeparrot/codeparrot-small-multi.
  8. Pit: a practical mutation testing tool for java. In Proceedings of the 25th international symposium on software testing and analysis, pages 449–452, 2016.
  9. Fixing rust compilation errors using llms. arXiv preprint arXiv:2308.05177, 2023.
  10. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 2130–2141, New York, NY, USA, 2022. Association for Computing Machinery.
  11. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
  12. G. Fraser and A. Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 416–419, 2011.
  13. G. Fraser and A. Arcuri. A large-scale evaluation of automated unit test generation using evosuite. ACM Trans. Softw. Eng. Methodol., 24(2), dec 2014.
  14. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, page 213–224, New York, NY, USA, 2016. Association for Computing Machinery.
  15. Measuring and mitigating gaps in structural testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1712–1723. IEEE, 2023.
  16. Neural-based test oracle generation: A large-scale evaluation and lessons learned. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 120–132, NY, USA, 2023. Association for Computing Machinery.
  17. Large language models for software engineering: A systematic literature review, 2024.
  18. Test oracle assessment and improvement. In Proceedings of the 25th International Symposium on Software Testing and Analysis, pages 247–258, 2016.
  19. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1646–1656, 2023.
  20. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, pages 437–440, 2014.
  21. Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 654–665, New York, NY, USA, 2014. Association for Computing Machinery.
  22. How effective are mutation testing tools? an empirical analysis of java mutation testing tools with manual analysis and real faults. Empirical Software Engineering, 23(4):2426–2463, 2018.
  23. Assessing and improving the mutation testing practice of pit. In 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pages 430–435, 2017.
  24. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919–931. IEEE, 2023.
  25. Cctest: Testing and repairing code completion systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1238–1250. IEEE, 2023.
  26. Quixbugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, SPLASH Companion 2017, page 55–56, New York, NY, USA, 2017. Association for Computing Machinery.
  27. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  28. The art of software testing. John Wiley & Sons, 2011.
  29. Using an llm to help with code understanding. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 881–881. IEEE Computer Society, 2024.
  30. H. News. Twitter outage report, 2016. https://news.ycombinator.com/item?id=8810157.
  31. Codegen: An open large language model for code with multi-turn program synthesis, 2023.
  32. C. Pacheco and M. D. Ernst. Randoop: feedback-directed random testing for java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pages 815–816, 2007.
  33. Inferring method specifications from natural language api descriptions. In 2012 34th International Conference on Software Engineering (ICSE), pages 815–825, 2012.
  34. Does mutation testing improve testing practices? In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 910–921. IEEE, 2021.
  35. Practical mutation testing at scale: A view from google. IEEE Transactions on Software Engineering, 2021.
  36. Phi-1. https://huggingface.co/microsoft/phi-1.
  37. A. M. Porrello. Death and denial: The failure of the therac-25, a medical linear accelerator. Death and Denial: The Failure of the THERAC-25, AMedical Linear Accelerator, 2012.
  38. A. C. Proper. Apache commons proper – a repository of reusable java components, 2022. https://commons.apache.org/components.html, Last accessed on 2022-10-11.
  39. Experimental comparison of automated mutation testing tools for java. In 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), pages 1–6, 2015.
  40. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  41. D. Schuler and A. Zeller. Checked coverage: an indicator for oracle quality. Software testing, verification and reliability, 23(7):531–551, 2013.
  42. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105, 2024.
  43. Exploring the effectiveness of large language models in generating unit tests. arXiv preprint arXiv:2305.00418, 2023.
  44. Synopsys Editorial Team. Coverity report on the ‘goto fail’ bug. Blog post, Synopsys, Mountain View, CA, Feb. 25, 2014; http://security.coverity.com/blog/2014/Feb/a-quick-post-on-apple-security-55471-aka-goto-fail.html.
  45. @tcomment: Testing javadoc comments to detect comment-code inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, pages 260–269, 2012.
  46. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617, 2020.
  47. Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test, pages 54–64, 2022.
  48. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 1398–1409, 2020.
  49. A systematic evaluation of large language models of code, 2022.
  50. Y. Zhang and A. Mesbah. Assertions are strongly correlated with test suite effectiveness. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 214–224, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Soneya Binta Hossain (7 papers)
  2. Matthew Dwyer (8 papers)
Citations (4)