Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Software Vulnerability and Functionality Assessment using LLMs (2403.08429v1)

Published 13 Mar 2024 in cs.SE and cs.AI

Abstract: While code review is central to the software development process, it can be tedious and expensive to carry out. In this paper, we investigate whether and how LLMs can aid with code reviews. Our investigation focuses on two tasks that we argue are fundamental to good reviews: (i) flagging code with security vulnerabilities and (ii) performing software functionality validation, i.e., ensuring that code meets its intended functionality. To test performance on both tasks, we use zero-shot and chain-of-thought prompting to obtain final ``approve or reject'' recommendations. As data, we employ seminal code generation datasets (HumanEval and MBPP) along with expert-written code snippets with security vulnerabilities from the Common Weakness Enumeration (CWE). Our experiments consider a mixture of three proprietary models from OpenAI and smaller open-source LLMs. We find that the former outperforms the latter by a large margin. Motivated by promising results, we finally ask our models to provide detailed descriptions of security vulnerabilities. Results show that 36.7% of LLM-generated descriptions can be associated with true CWE vulnerabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Falcon-40B: An open large language model with state-of-the-art performance. https://huggingface.co/tiiuae/falcon-7b
  2. Program synthesis with large language models. arXiv:2108.07732
  3. Evaluating large language models trained on code. arXiv:2107.03374
  4. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
  5. Code reviews do not find bugs. how the current code review best practice slows us down. In 37th International Conference on Software Engineering, Vol. 2. IEEE, Florence, Italy, 27–28.
  6. Akshita Jha and Chandan K Reddy. 2023. Codeattack: Code-based adversarial attacks for pre-trained programming language models. arXiv:2206.00052
  7. Classifying software changes: Clean or buggy? IEEE Transactions on software engineering 34, 2 (2008), 181–196.
  8. DeepReview: Automatic Code Review Using Deep Multi-instance Learning. In Advances in Knowledge Discovery and Data Mining, Qiang Yang, Zhi-Hua Zhou, Zhiguo Gong, Min-Ling Zhang, and Sheng-Jun Huang (Eds.). Springer International Publishing, Cham, 318–330.
  9. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. https://openreview.net/forum?id=1qvx610Cu7
  10. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. In IEEE 34th International Symposium on Software Reliability Engineering. IEEE, Florence, Italy, 647–658.
  11. MITRE. 2023. Common Weakness Enumeration. https://cwe.mitre.org/
  12. Do code review practices impact design quality? a case study of the qt, vtk, and itk projects. In 22nd international conference on software analysis, evolution, and reengineering (SANER). IEEE, Montreal, Canada, 171–180.
  13. OpenAI. 2023. Models. https://platform.openai.com/docs/models
  14. Code Revert Prediction with Graph Neural Networks: A Case Study at J.P. Morgan Chase. In 1st International Workshop on Software Defect Datasets. ACM, San Francisco, CA, USA, 1–5.
  15. Code llama: Open foundation models for code. arXiv:2308.12950
  16. Automated Identification of Toxic Code Reviews Using ToxiCR. arXiv:2202.13056
  17. Automatic code review by learning the revision of source code. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI, Honolulu, HI, USA, 4910–4917.
  18. Latif Sunny and Joanna Santos. 2023. SecurityEval. www.github.com/s2e-lab/SecurityEval
  19. Search-Based Optimisation of LLM Learning Shots for Story Point Estimation. In International Symposium on Search-Based Software Engineering. Springer, San Francisco, CA, USA, 123–129.
  20. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288
  21. Using pre-trained models to boost code review automation. arXiv:2201.06850
  22. ReCode: Robustness Evaluation of Code Generation Models. arXiv:2212.10264
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rasmus Ingemann Tuffveson Jensen (2 papers)
  2. Vali Tawosi (5 papers)
  3. Salwa Alamir (8 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com