Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pros and Cons! Evaluating ChatGPT on Software Vulnerability (2404.03994v1)

Published 5 Apr 2024 in cs.SE

Abstract: This paper proposes a pipeline for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available dataset. We carry out an extensive technical evaluation of ChatGPT using Big-Vul covering five different common software vulnerability tasks. We evaluate the multitask and multilingual aspects of ChatGPT based on this dataset. We found that the existing state-of-the-art methods are generally superior to ChatGPT in software vulnerability detection. Although ChatGPT improves accuracy when providing context information, it still has limitations in accurately predicting severity ratings for certain CWE types. In addition, ChatGPT demonstrates some ability in locating vulnerabilities for certain CWE types, but its performance varies among different CWE types. ChatGPT exhibits limited vulnerability repair capabilities in both providing and not providing context information. Finally, ChatGPT shows uneven performance in generating CVE descriptions for various CWE types, with limited accuracy in detailed information. Overall, though ChatGPT performs well in some aspects, it still needs improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities in order to fully realize its potential. Our evaluation framework provides valuable insights for further enhancing ChatGPT' s software vulnerability handling capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. 2023. Joern. https://github.com/joernio/joern
  2. 2023. Replication. https://figshare.com/s/04856ae0c9005a888e03
  3. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).
  5. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39.
  6. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  7. MVD: Memory-Related Vulnerability Detection Based on Flow-Sensitive Graph Neural Networks. arXiv preprint arXiv:2203.02660 (2022).
  8. Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering (2021).
  9. chatgptendpoint. 2023. Introducing ChatGPT and Whisper APIs. https://openai.com/blog/introducing-chatgpt-and-whisper-apis.
  10. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering (2019).
  11. Path-sensitive code embedding via contrastive learning for software vulnerability detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 519–531.
  12. Chatgpt goes to law school. Available at SSRN (2023).
  13. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
  14. Smoke: scalable path-sensitive memory leak detection for millions of lines of code. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 72–82.
  15. A C/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.
  16. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  17. The effect of common vulnerability scoring system metrics on vulnerability exploit delay. In 2018 Sixth International Symposium on Computing and Networking (CANDAR). IEEE, 1–10.
  18. Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A Transformer-based Line-Level Vulnerability Prediction. (2022).
  19. How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment. medRxiv (2022), 2022–12.
  20. Yoav Goldberg. 2023a. Friend or foe? teachers debate chatgpt. https://www.axios.com/2023/01/13/chatgpt-schools-teachers-ai-debate
  21. Yoav Goldberg. 2023b. Some remarks on large language models. https://gist.github.com/yoavg/59d174608e92e845c8994ac2e234c8a9
  22. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv preprint arXiv:2203.03850 (2022).
  23. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  24. Detecting and augmenting missing key aspects in vulnerability descriptions. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 3 (2022), 1–27.
  25. Key aspects augmentation of vulnerability description based on multiple security databases. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 1020–1025.
  26. Predicting missing information of key aspects in vulnerability reports. arXiv preprint arXiv:2008.02456 (2020).
  27. LineVD: Statement-level Vulnerability Detection using Graph Neural Networks. arXiv preprint arXiv:2203.05181 (2022).
  28. Saad Khan and Simon Parkinson. 2018. Review into state of the art of vulnerability assessment using artificial intelligence. Guide to Vulnerability Analysis for Computer Networks and Systems (2018), 3–32.
  29. A survey on data-driven software vulnerability assessment and prioritization. ACM Computing Surveys (CSUR) (2021).
  30. Deepcva: Automated commit-level vulnerability assessment with deep multi-task learning. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 717–729.
  31. PCA: memory leak detection using partial call-path analysis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1621–1625.
  32. Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303.
  33. Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing (2021).
  34. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing (2021).
  35. Vuldeepecker: A deep learning-based system for vulnerability detection. In Proceedings of the 25th Annual Network and Distributed System Security Symposium.
  36. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  37. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  38. Corporation MITRE. 2023. Common Vulnerabilities and Exposures (CVE). https://cve.mitre.org/
  39. FVA: Assessing Function-Level Vulnerability by Integrating Flow-Sensitive Structure and Code Statement Semantic. In 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 339–350.
  40. The Best of Both Worlds: Integrating Semantic Features with Expert Features for Defect Prediction and Localization. In Proceedings of the 2022 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 672–683.
  41. Defect Identification, Categorization, and Repair: Better Together. arXiv preprint arXiv:2204.04856 (2022).
  42. Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1611–1622.
  43. OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. (2022). https://openai.com/blog/chatgpt/.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  45. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  46. Improving language understanding by generative pre-training. (2018).
  47. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  48. ChatGPT and other large language models are double-edged swords. , 230163 pages.
  49. Jessica Shieh. 2023. Best practices for prompt engineering with OpenAI API. OpenAI, February https://help.openai. com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api (2023).
  50. Georgios Spanos and Lefteris Angelis. 2018. A multi-target approach to estimate software vulnerability characteristics and severity scores. Journal of Systems and Software 146 (2018), 152–166.
  51. Generating informative CVE description from ExploitDB posts by extractive summarization. ACM Transactions on Software Engineering and Methodology (TOSEM) (2022).
  52. Symantec. 2023. securityFocus. https://www.securityfocus.com/
  53. Attention is all you need. Advances in neural information processing systems 30 (2017).
  54. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
  55. VulCNN: An Image-inspired Scalable Vulnerability Detection System. (2022).
  56. Xin Yin and Chao Ni. 2024. Multitask-based Evaluation of Open-Source LLM on Software Vulnerability. arXiv preprint arXiv:2404.02056 (2024).
  57. Atvhunter: Reliable version detection of third-party libraries for vulnerability identification in android applications. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1695–1707.
  58. Program Repair: Automated vs. Manual. arXiv preprint arXiv:2203.05166 (2022).
  59. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 10197–10207.
  60. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 341–353.
  61. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Xin Yin (31 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com