Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs (2401.05940v1)

Published 11 Jan 2024 in cs.SE and cs.AI

Abstract: LLMs have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language. In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies. We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Sabah A Abdulkareem and Ali J Abboud. 2021. Evaluating python, c++, javascript and java programming languages based on software complexity calculator (halstead metrics). In IOP Conference Series: Materials Science and Engineering, Vol. 1076. IOP Publishing, 012046.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
  3. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  4. Timothy A. Budd and Dana Angluin. 1982. Two notions of correctness and their relation to testing. Acta Informatica 18, 1 (01 Mar 1982), 31–45. https://doi.org/10.1007/BF00625279
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  6. Designing Deletion Mutation Operators. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation. 11–20. https://doi.org/10.1109/ICST.2014.12
  7. A Systematic Review of Mutation Testing Tools: A Survey. Software Testing, Verification and Reliability 27, 3 (2017).
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  9. Large Language Models for Software Engineering: Survey and Open Problems. arXiv preprint arXiv:2310.03533 (2023).
  10. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10–19.
  11. How hard does mutation analysis have to be, anyway?. In 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 216–227.
  12. Soukaina Hamimoune and Bouchaib Falah. 2016. Mutation testing techniques: A comparative study. In 2016 international conference on engineering & MIS (ICEMIS). IEEE, 1–9.
  13. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
  14. Mintlify Inc. 2023a. Mintlify. https://www.figstack.com/ Accessed: 2023.
  15. TabNine Inc. 2023b. TabNine is an AI-powered code completion tool. https://www.tabnine.com/ Accessed: 2023-11-21.
  16. Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.
  17. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 654–665.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  19. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  20. The Scope of ChatGPT in Software Engineering: A Thorough Investigation. arXiv preprint arXiv:2305.12138 (2023).
  21. Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation. IEEE Transactions on Software Engineering 40, 1 (2014), 23–42. https://doi.org/10.1109/TSE.2013.44
  22. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124 (2023).
  23. OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  24. OpenAI. 2023b. OpenAI API Documentation. https://platform.openai.com/docs/
  25. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946.
  26. Mutation Testing Advances: An Analysis and Survey. ACM Computing Surveys (CSUR) 51, 5 (2019), 1–33.
  27. Goran Petrović and Marko Ivanković. 2018. State of mutation testing at google. In Proceedings of the 40th international conference on software engineering: Software engineering in practice. 163–171.
  28. Practical mutation testing at scale: A view from google. IEEE Transactions on Software Engineering 48, 10 (2021), 3900–3912.
  29. Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
  30. Improving language understanding by generative pre-training. Technical Report. OpenAI.
  31. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  32. Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1–7.
  33. A Primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2020), 842–866.
  34. Inbal Shani and GitHub Staff. 2023. Survey reveals AI’s impact on the developer experience. https://github.blog/2023-06-13-survey-reveals-ais-impact-on-the-developer-experience/. Accessed: 2023-11-21.
  35. Benchmarking Language Models for Code Syntax Understanding. arXiv preprint arXiv:2210.14473 (2022).
  36. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020).
  37. Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023).
  38. Sergey Troshin and Nadezhda Chirkova. 2022. Probing pretrained models of source code. arXiv preprint arXiv:2202.08975 (2022).
  39. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3539–3549.
  40. Code generation as a dual task of code summarization. Advances in neural information processing systems 32 (2019).
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  42. Ian H Witten and Eibe Frank. 2002. Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record 31, 1 (2002), 76–77.
  43. Mutation Testing for the New Century. Workshop on Empirical Research in Software Testing (2000), 1–5.
  44. Is Operator-Based Mutant Selection Superior to Random Mutant Selection?. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1 (Cape Town, South Africa) (ICSE ’10). Association for Computing Machinery, New York, NY, USA, 435–444. https://doi.org/10.1145/1806799.1806863
  45. Predicting folding free energy changes upon single point mutations. Bioinformatics 28, 5 (2012), 664–671.
  46. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568 (2023).
  47. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. arXiv preprint arXiv:2310.04406 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ziyu Li (34 papers)
  2. Donghwan Shin (21 papers)
Citations (8)