Breaking the Silence: the Threats of Using LLMs in Software Engineering (2312.08055v2)
Abstract: LLMs have gained considerable traction within the Software Engineering (SE) community, impacting various SE tasks from code completion to test generation, from program repair to code summarization. Despite their promise, researchers must still be careful as numerous intricate factors can influence the outcomes of experiments involving LLMs. This paper initiates an open discussion on potential threats to the validity of LLM-based research including issues such as closed-source models, possible data leakage between LLM training data and research evaluation, and the reproducibility of LLM-based findings. In response, this paper proposes a set of guidelines tailored for SE researchers and LLM (LM) providers to mitigate these concerns. The implications of the guidelines are illustrated using existing good practices followed by LLM providers and a practical example for SE researchers in the context of test case generation.
- 2023. Hugging Face – The AI community building the future. https://huggingface.co [Online; accessed 11. Sept. 2023].
- 2023. LeetCode - The World’s Leading Online Programming Learning Platform. https://leetcode.com [Online; accessed 12. Sept. 2023].
- 2023. Zenodo. https://zenodo.org [Online; accessed 11. Sept. 2023].
- A systematic review on code clone detection. IEEE access 7 (2019), 86121–86144.
- Ali Al-Kaswan and Maliheh Izadi. 2023. The (ab)use of Open Source Code to Train Large Language Models. In 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE).
- A3Test: Assertion-Augmented Automated Test Case Generation. arXiv preprint arXiv:2302.10352 (2023).
- The Falcon Series of Language Models: Towards Open Frontier Models. (2023).
- Searching for Quality: Genetic Algorithms and Metamorphic Testing for Software Engineering ML. In Proc. of the Genetic and Evolutionary Computation Conference. 1490–1498.
- Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1377–1381.
- Andrea Arcuri and Lionel Briand. 2014. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24, 3 (2014), 219–250.
- Is github’s copilot as bad as humans at introducing vulnerabilities in code? arXiv preprint arXiv:2204.04741 (2022).
- Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 (2022).
- Authors. 2023. https://github.com/LLM4SE/obfuscated-ChatGPT-experiments
- How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009 (July 2023). arXiv:2307.09009
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Counterfactual explanations for models of code. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 125–134.
- Embedding Java Classes with code2vec: Improvements from Variable Obfuscation. In 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). IEEE, New York, NY, USA, 243–253. https://doi.org/10.1145/3379597.3387445
- JUGE: An infrastructure for benchmarking Java unit test generators. Software Testing, Verification and Reliability 33, 3 (2023), e1838.
- Richard H. Epstein and Franklin Dexter. 2023. Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”. JMIR Medical Education 9, 1 (July 2023), e48305. https://doi.org/10.2196/48305
- Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128
- Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT ESEC/FSE (Szeged, Hungary) (ESEC/FSE ’11). ACM, New York, NY, USA, 416–419. https://doi.org/10.1145/2025113.2025179
- Training Data Leakage Analysis in Language Models. (February 2021). https://www.microsoft.com/en-us/research/publication/training-data-leakage-analysis-in-language-models/
- Large Language Models and Simple, Stupid Bugs. arXiv preprint arXiv:2303.11455 (2023).
- Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). ACM, NY, USA, 437–440.
- Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation. arXiv (Dec. 2022). https://doi.org/10.48550/arXiv.2212.02684 arXiv:2212.02684
- CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models. In Proc. of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, Melbourne, Victoria, Australia, 919–931. https://doi.org/10.1109/ICSE48619.2023.00085
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- OpenAI. 2023. OpenAI. https://openai.com/ Accessed on September 14th, 2023.
- Ipek Ozkaya. 2023. Application of Large Language Models to Software Engineering Tasks: Opportunities, Risks, and Implications. IEEE Software 40, 3 (April 2023), 4–8. https://doi.org/10.1109/MS.2023.3248401
- Reformulating branch coverage as a many-objective optimization problem. In 2015 IEEE 8th international conference on software testing, verification and validation (ICST). IEEE, 1–10.
- Sbst tool competition 2021. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST). IEEE, 20–27.
- Examining Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, Los Alamitos, CA, USA, 2339–2356. https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.10179420
- On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research. arXiv:2304.12397 [cs.CL]
- Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
- ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. arXiv preprint arXiv:2307.00588 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
- A systematic review of Green AI. WIREs Data Mining and Knowledge Discovery 13, 4 (2023), e1507. https://doi.org/10.1002/widm.1507
- Large Language Models in Fault Localisation. arXiv preprint arXiv:2308.15276 (2023).
- Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
- Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
- Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493.
- Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility. arXiv preprint arXiv:2305.10235 (2023).
- Adversarial examples for models of code. Proc. of the ACM on Programming Languages 4, OOPSLA (2020), 1–30.
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
- June Sallou (11 papers)
- Thomas Durieux (40 papers)
- Annibale Panichella (21 papers)