Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Breaking the Silence: the Threats of Using LLMs in Software Engineering (2312.08055v2)

Published 13 Dec 2023 in cs.SE and cs.LG

Abstract: LLMs have gained considerable traction within the Software Engineering (SE) community, impacting various SE tasks from code completion to test generation, from program repair to code summarization. Despite their promise, researchers must still be careful as numerous intricate factors can influence the outcomes of experiments involving LLMs. This paper initiates an open discussion on potential threats to the validity of LLM-based research including issues such as closed-source models, possible data leakage between LLM training data and research evaluation, and the reproducibility of LLM-based findings. In response, this paper proposes a set of guidelines tailored for SE researchers and LLM (LM) providers to mitigate these concerns. The implications of the guidelines are illustrated using existing good practices followed by LLM providers and a practical example for SE researchers in the context of test case generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. 2023. Hugging Face – The AI community building the future. https://huggingface.co [Online; accessed 11. Sept. 2023].
  2. 2023. LeetCode - The World’s Leading Online Programming Learning Platform. https://leetcode.com [Online; accessed 12. Sept. 2023].
  3. 2023. Zenodo. https://zenodo.org [Online; accessed 11. Sept. 2023].
  4. A systematic review on code clone detection. IEEE access 7 (2019), 86121–86144.
  5. Ali Al-Kaswan and Maliheh Izadi. 2023. The (ab)use of Open Source Code to Train Large Language Models. In 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE).
  6. A3Test: Assertion-Augmented Automated Test Case Generation. arXiv preprint arXiv:2302.10352 (2023).
  7. The Falcon Series of Language Models: Towards Open Frontier Models. (2023).
  8. Searching for Quality: Genetic Algorithms and Metamorphic Testing for Software Engineering ML. In Proc. of the Genetic and Evolutionary Computation Conference. 1490–1498.
  9. Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1377–1381.
  10. Andrea Arcuri and Lionel Briand. 2014. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24, 3 (2014), 219–250.
  11. Is github’s copilot as bad as humans at introducing vulnerabilities in code? arXiv preprint arXiv:2204.04741 (2022).
  12. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 (2022).
  13. Authors. 2023. https://github.com/LLM4SE/obfuscated-ChatGPT-experiments
  14. How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009 (July 2023). arXiv:2307.09009
  15. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  16. Counterfactual explanations for models of code. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 125–134.
  17. Embedding Java Classes with code2vec: Improvements from Variable Obfuscation. In 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). IEEE, New York, NY, USA, 243–253. https://doi.org/10.1145/3379597.3387445
  18. JUGE: An infrastructure for benchmarking Java unit test generators. Software Testing, Verification and Reliability 33, 3 (2023), e1838.
  19. Richard H. Epstein and Franklin Dexter. 2023. Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”. JMIR Medical Education 9, 1 (July 2023), e48305. https://doi.org/10.2196/48305
  20. Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128
  21. Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT ESEC/FSE (Szeged, Hungary) (ESEC/FSE ’11). ACM, New York, NY, USA, 416–419. https://doi.org/10.1145/2025113.2025179
  22. Training Data Leakage Analysis in Language Models. (February 2021). https://www.microsoft.com/en-us/research/publication/training-data-leakage-analysis-in-language-models/
  23. Large Language Models and Simple, Stupid Bugs. arXiv preprint arXiv:2303.11455 (2023).
  24. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). ACM, NY, USA, 437–440.
  25. Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation. arXiv (Dec. 2022). https://doi.org/10.48550/arXiv.2212.02684 arXiv:2212.02684
  26. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models. In Proc. of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, Melbourne, Victoria, Australia, 919–931. https://doi.org/10.1109/ICSE48619.2023.00085
  27. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  28. OpenAI. 2023. OpenAI. https://openai.com/ Accessed on September 14th, 2023.
  29. Ipek Ozkaya. 2023. Application of Large Language Models to Software Engineering Tasks: Opportunities, Risks, and Implications. IEEE Software 40, 3 (April 2023), 4–8. https://doi.org/10.1109/MS.2023.3248401
  30. Reformulating branch coverage as a many-objective optimization problem. In 2015 IEEE 8th international conference on software testing, verification and validation (ICST). IEEE, 1–10.
  31. Sbst tool competition 2021. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST). IEEE, 20–27.
  32. Examining Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, Los Alamitos, CA, USA, 2339–2356. https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.10179420
  33. On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research. arXiv:2304.12397 [cs.CL]
  34. Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
  35. ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. arXiv preprint arXiv:2307.00588 (2023).
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  38. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
  39. A systematic review of Green AI. WIREs Data Mining and Knowledge Discovery 13, 4 (2023), e1507. https://doi.org/10.1002/widm.1507
  40. Large Language Models in Fault Localisation. arXiv preprint arXiv:2308.15276 (2023).
  41. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  42. Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
  43. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493.
  44. Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility. arXiv preprint arXiv:2305.10235 (2023).
  45. Adversarial examples for models of code. Proc. of the ACM on Programming Languages 4, OOPSLA (2020), 1–30.
  46. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. June Sallou (11 papers)
  2. Thomas Durieux (40 papers)
  3. Annibale Panichella (21 papers)