Software Vulnerability and Functionality Assessment using LLMs (2403.08429v1)
Abstract: While code review is central to the software development process, it can be tedious and expensive to carry out. In this paper, we investigate whether and how LLMs can aid with code reviews. Our investigation focuses on two tasks that we argue are fundamental to good reviews: (i) flagging code with security vulnerabilities and (ii) performing software functionality validation, i.e., ensuring that code meets its intended functionality. To test performance on both tasks, we use zero-shot and chain-of-thought prompting to obtain final ``approve or reject'' recommendations. As data, we employ seminal code generation datasets (HumanEval and MBPP) along with expert-written code snippets with security vulnerabilities from the Common Weakness Enumeration (CWE). Our experiments consider a mixture of three proprietary models from OpenAI and smaller open-source LLMs. We find that the former outperforms the latter by a large margin. Motivated by promising results, we finally ask our models to provide detailed descriptions of security vulnerabilities. Results show that 36.7% of LLM-generated descriptions can be associated with true CWE vulnerabilities.
- Falcon-40B: An open large language model with state-of-the-art performance. https://huggingface.co/tiiuae/falcon-7b
- Program synthesis with large language models. arXiv:2108.07732
- Evaluating large language models trained on code. arXiv:2107.03374
- Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
- Code reviews do not find bugs. how the current code review best practice slows us down. In 37th International Conference on Software Engineering, Vol. 2. IEEE, Florence, Italy, 27–28.
- Akshita Jha and Chandan K Reddy. 2023. Codeattack: Code-based adversarial attacks for pre-trained programming language models. arXiv:2206.00052
- Classifying software changes: Clean or buggy? IEEE Transactions on software engineering 34, 2 (2008), 181–196.
- DeepReview: Automatic Code Review Using Deep Multi-instance Learning. In Advances in Knowledge Discovery and Data Mining, Qiang Yang, Zhi-Hua Zhou, Zhiguo Gong, Min-Ling Zhang, and Sheng-Jun Huang (Eds.). Springer International Publishing, Cham, 318–330.
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. https://openreview.net/forum?id=1qvx610Cu7
- LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. In IEEE 34th International Symposium on Software Reliability Engineering. IEEE, Florence, Italy, 647–658.
- MITRE. 2023. Common Weakness Enumeration. https://cwe.mitre.org/
- Do code review practices impact design quality? a case study of the qt, vtk, and itk projects. In 22nd international conference on software analysis, evolution, and reengineering (SANER). IEEE, Montreal, Canada, 171–180.
- OpenAI. 2023. Models. https://platform.openai.com/docs/models
- Code Revert Prediction with Graph Neural Networks: A Case Study at J.P. Morgan Chase. In 1st International Workshop on Software Defect Datasets. ACM, San Francisco, CA, USA, 1–5.
- Code llama: Open foundation models for code. arXiv:2308.12950
- Automated Identification of Toxic Code Reviews Using ToxiCR. arXiv:2202.13056
- Automatic code review by learning the revision of source code. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI, Honolulu, HI, USA, 4910–4917.
- Latif Sunny and Joanna Santos. 2023. SecurityEval. www.github.com/s2e-lab/SecurityEval
- Search-Based Optimisation of LLM Learning Shots for Story Point Estimation. In International Symposium on Search-Based Software Engineering. Springer, San Francisco, CA, USA, 123–129.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288
- Using pre-trained models to boost code review automation. arXiv:2201.06850
- ReCode: Robustness Evaluation of Code Generation Models. arXiv:2212.10264
- Rasmus Ingemann Tuffveson Jensen (2 papers)
- Vali Tawosi (5 papers)
- Salwa Alamir (8 papers)