Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine (2404.03664v2)

Published 16 Feb 2024 in cs.SE and cs.AI

Abstract: The Cancer Registry of Norway (CRN) uses an automated cancer registration support system (CaReSS) to support core cancer registry activities, i.e, data capture, data curation, and producing data products and statistics for various stakeholders. GURI is a core component of CaReSS, which is responsible for validating incoming data with medical rules. Such medical rules are manually implemented by medical experts based on medical standards, regulations, and research. Since LLMs have been trained on a large amount of public information, including these documents, they can be employed to generate tests for GURI. Thus, we propose an LLM-based test generation and differential testing approach (LLMeDiff) to test GURI. We experimented with four different LLMs, two medical rule engine implementations, and 58 real medical rules to investigate the hallucination, success, time efficiency, and robustness of the LLMs to generate tests, and these tests' ability to find potential issues in GURI. Our results showed that GPT-3.5 hallucinates the least, is the most successful, and is generally the most robust; however, it has the worst time efficiency. Our differential testing revealed 22 medical rules where implementation inconsistencies were discovered (e.g., regarding handling rule versions). Finally, we provide insights for practitioners and researchers based on the results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binarie. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023, Tao Zhang, Xin Xia, and Nicole Novielli (Eds.). IEEE, 260–271. https://doi.org/10.1109/SANER56733.2023.00033
  2. Andrea Arcuri. 2019. RESTful API Automated Test Case Generation with EvoMaster. ACM Transactions on Software Engineering and Methodology 28, 1 (Feb. 2019), 1–37. https://doi.org/10.1145/3293455
  3. Andrea Arcuri and Lionel Briand. 2011. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering. In Proceedings of the 33rd International Conference on Software Engineering (ICSE 2011). ACM. https://doi.org/10.1145/1985793.1985795
  4. The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering 41, 5 (May 2015), 507–525. https://doi.org/10.1109/tse.2014.2372785
  5. Yoav Benjamini and Daniel Yekutieli. 2001. The Control of the False Discovery Rate in Multiple Testing under Dependency. The Annals of Statistics 29, 4 (Aug. 2001), 1165–1188. https://doi.org/10.1214/aos/1013699998
  6. NatGen: generative pre-training by ”naturalizing” source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022, Abhik Roychoudhury, Cristian Cadar, and Miryung Kim (Eds.). ACM, 18–30. https://doi.org/10.1145/3540250.3549162
  7. Metamorphic Testing: A Review of Challenges and Opportunities. Comput. Surveys 51, 1 (2018), 4:1–4:27. https://doi.org/10.1145/3143561
  8. Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 575–593. https://doi.org/10.18653/V1/2023.ACL-LONG.34
  9. The future landscape of large language models in medicine. Communications Medicine 3, 1 (2023), 141.
  10. Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing. CoRR abs/2308.16557 (2023). https://doi.org/10.48550/ARXIV.2308.16557 arXiv:2308.16557
  11. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023, René Just and Gordon Fraser (Eds.). ACM, 423–435. https://doi.org/10.1145/3597926.3598067
  12. Olive Jean Dunn. 1964. Multiple Comparisons Using Rank Sums. Technometrics 6, 3 (Aug. 1964), 241–252. https://doi.org/10.1080/00401706.1964.10490181
  13. What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs?. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023. IEEE, 761–773. https://doi.org/10.1109/ASE56229.2023.00109
  14. Differential regression testing for REST APIs. In ISSTA ’20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA, July 18-22, 2020, Sarfraz Khurshid and Corina S. Pasareanu (Eds.). ACM, 312–323. https://doi.org/10.1145/3395363.3397374
  15. Vitor Guilherme and Auri Vincenzi. 2023. An initial investigation of ChatGPT unit test generation capability. In 8th Brazilian Symposium on Systematic and Automated Software Testing, SAST 2023, Campo Grande, MS, Brazil, September 25-29, 2023, Awdren L. Fontão, Débora M. B. Paiva, Hudson Borges, Maria Istela Cagnin, Patrícia Gomes Fernandes, Vanessa Borges, Silvana M. Melo, Vinicius H. S. Durelli, and Edna Dias Canedo (Eds.). ACM, 15–24. https://doi.org/10.1145/3624032.3624035
  16. Richard G. Hamlet. 2021. Random Testing. Essentials of Software Testing (2021). https://api.semanticscholar.org/CorpusID:6665543
  17. Melinda R. Hess and Jeffrey D. Kromrey. 2004. Robust Confidence Intervals for Effect Sizes: A Comparative Study of Cohen’s d and Cliff’s Delta Under Non-Normality and Heterogeneous Variances. Annual Meeting of the American Educational Research Association (April 2004).
  18. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. CoRR abs/2311.05232 (2023). https://doi.org/10.48550/ARXIV.2311.05232 arXiv:2311.05232
  19. METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities. CoRR abs/2312.06056 (2023). https://doi.org/10.48550/ARXIV.2312.06056 arXiv:2312.06056
  20. Cost Reduction on Testing Evolving Cancer Registry System. In Proceedings of the 39th IEEE International Conference on Software Maintenance and Evolution (ICSME 2023). IEEE.
  21. Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering 37, 5 (Sept. 2011), 649–678. https://doi.org/10.1109/TSE.2010.62
  22. Mistral 7B. CoRR abs/2310.06825 (2023). https://doi.org/10.48550/ARXIV.2310.06825 arXiv:2310.06825
  23. Mixtral of Experts. CoRR abs/2401.04088 (2024). https://doi.org/10.48550/ARXIV.2401.04088 arXiv:2401.04088
  24. William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis. J. Amer. Statist. Assoc. 47, 260 (Dec. 1952), 583–621. https://doi.org/10.1080/01621459.1952.10483441
  25. Automated Test Generation for Medical Rules Web Services: A Case Study at the Cancer Registry of Norway. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). ACM. https://doi.org/10.1145/3611643.3613882
  26. Challenges of Testing an Evolving Cancer Registration Support System in Practice. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering: Companion Proceedings (ICSE-Companion 2023). IEEE, 355–359. https://doi.org/10.1109/ICSE-Companion58688.2023.00102
  27. Using Large Language Models to Enhance Programming Error Messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education, Volume 1, SIGCSE 2023, Toronto, ON, Canada, March 15-18, 2023, Maureen Doyle, Ben Stephenson, Brian Dorn, Leen-Kiat Soh, and Lina Battestilli (Eds.). ACM, 563–569. https://doi.org/10.1145/3545945.3569770
  28. EvoCLINICAL: Evolving Cyber-Cyber Digital Twin with Active Transfer Learning for Automated Cancer Registry System. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 1973–1984. https://doi.org/10.1145/3611643.3613897
  29. Automated Refactoring of OCL Constraints with Search. IEEE Trans. Software Eng. 45, 2 (2019), 148–170. https://doi.org/10.1109/TSE.2017.2774829
  30. William M. McKeeman. 1998. Differential Testing for Software. Digital Technical Journal 10, 1 (1998), 100–107. https://www.hpl.hp.com/hpjournal/dtj/vol10num1/vol10num1art9.pdf
  31. OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774
  32. OpenAI. [2024]. GPT 3.5. https://platform.openai.com/docs/models/gpt-3-5
  33. A study of generative large language model for medical research and healthcare. npj Digitital Medicine 6 (2023). https://doi.org/10.1038/S41746-023-00958-W
  34. Metamorphic Testing of RESTful Web APIs. IEEE Trans. Software Eng. 44, 11 (2018), 1083–1099. https://doi.org/10.1109/TSE.2017.2764464
  35. Lijun Shan and Hong Zhu. 2009. Generating Structurally Complex Test Cases By Data Mutation: A Case Study Of Testing An Automated Modelling Tool. Comput. J. 52, 5 (2009), 571–588. https://doi.org/10.1093/COMJNL/BXM043
  36. Klaas-Jan Stol and Brian Fitzgerald. 2018. The ABC of Software Engineering Research. ACM Transactions on Software Engineering and Methodology 27, 3 (Oct. 2018), 1–51. https://doi.org/10.1145/3241743
  37. Large Language Models in Medicine. Nature medicine 29, 8 (2023), 1930–1940.
  38. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
  39. Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers. In IEEE/ACM International Conference on Automation of Software Test, AST@ICSE 2022, Pittsburgh, PA, USA, May 21-22, 2022. ACM/IEEE, 54–64. https://doi.org/10.1145/3524481.3527220
  40. András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the ”CL” Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25, 2 (2000), 101–132. https://doi.org/10.2307/1165329
  41. Can Large Language Models Write Good Property-Based Tests? CoRR abs/2307.04346 (2023). https://doi.org/10.48550/ARXIV.2307.04346 arXiv:2307.04346
  42. Software Testing with Large Language Model: Survey, Landscape, and Vision. CoRR abs/2307.07221 (2023). https://doi.org/10.48550/ARXIV.2307.07221 arXiv:2307.07221
  43. MBF4CR: A Model-Based Framework for Supporting an Automated Cancer Registry System. In Modelling Foundations and Applications - 12th European Conference, ECMFA@STAF 2016, Vienna, Austria, July 6-7, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9764), Andrzej Wasowski and Henrik Lönn (Eds.). Springer, 191–204. https://doi.org/10.1007/978-3-319-42061-5_12
  44. RCIA: Automated Change Impact Analysis to Facilitate a Practical Cancer Registry System. In 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017. IEEE Computer Society, 603–612. https://doi.org/10.1109/ICSME.2017.22
  45. The shaky foundations of large language models and foundation models for electronic health records. npj Digitital Medicine 6 (2023). https://doi.org/10.1038/S41746-023-00879-8
  46. ChatUniTest: a ChatGPT-based automated unit test generation tool. CoRR abs/2305.04764 (2023). https://doi.org/10.48550/ARXIV.2305.04764 arXiv:2305.04764
  47. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS@PLDI 2022), Swarat Chaudhuri and Charles Sutton (Eds.). ACM, 1–10. https://doi.org/10.1145/3520312.3534862
  48. Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning. CoRR abs/2312.14184 (2023). https://doi.org/10.48550/ARXIV.2312.14184 arXiv:2312.14184
  49. Cumulative Reasoning with Large Language Models. CoRR abs/2308.04371 (2023). https://doi.org/10.48550/ARXIV.2308.04371 arXiv:2308.04371
  50. Software Unit Test Coverage and Adequacy. Comput. Surveys 29, 4 (1997), 366–427. https://doi.org/10.1145/267580.267590
Citations (1)

Summary

We haven't generated a summary for this paper yet.