Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vulnerability Detection with Code Language Models: How Far Are We? (2403.18624v2)

Published 27 Mar 2024 in cs.SE and cs.CL

Abstract: In the context of the rising interest in code LLMs (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

Novel Challenges in Vulnerability Detection with Code LLMs: Insights from the PrimeVul Dataset

Overview of the Study

The efficacy of Code LLMs (Code LMs) in vulnerability detection has been a subject of research interest. Traditional datasets and benchmarks have presented various limitations that potentially overestimate the capabilities of these models. This paper introduces PrimeVul, a new dataset aimed at training and evaluating Code LMs in a more realistic and challenging environment for vulnerability detection. The paper meticulously analyzes the shortcomings of existing benchmarks in terms of data quality issues and evaluation metrics, and proposes rigorous solutions, including a novel dataset and evaluation guidelines.

Limitations of Existing Datasets and Benchmarks

The paper identifies critical limitations in current vulnerability detection benchmarks:

  • Noisy Labels: The dichotomy between automated and manual labeling has resulted in a tradeoff between dataset size and label accuracy. Automated labeling often introduces significant noise, while manual labeling, although accurate, is not scalable.
  • Data Duplication: A considerable amount of data duplication has been found across the training and testing sets in existing benchmarks, leading to inflated and misleading performance metrics.
  • Evaluation Metrics: Current benchmarks use accuracy and F1 scores as metrics, neither of which adequately reflect the practical utility of models. There is a need for metrics that consider false positive and false negative rates in context.

Introduction of PrimeVul

To address these limitations, PrimeVul employs a series of novel approaches:

  • Rigorous Data Collection and Labeling: PrimeVul utilizes algorithms that significantly improve label accuracy by leveraging expert analyses and unique commit changes. This process reduces data duplication and noise, making the dataset a more reliable benchmark.
  • Temporal Splitting and Novel Evaluation Metrics: PrimeVul introduces temporal data splitting to mitigate data leakage and proposes the Vulnerability Detection Score (VD-S) metric. VD-S measures the false negative rate at a configurable false positive rate threshold, providing a more realistic evaluation of model effectiveness.
  • Pairwise Evaluation: Beyond conventional evaluation methods, PrimeVul incorporates pairwise evaluations of vulnerable-benign function pairs. This method assesses a model’s nuanced understanding of code vulnerabilities.

Evaluation of Code LMs on PrimeVul

Code LMs evaluated on PrimeVul reveal illuminating insights:

  • Benchmark Overestimation: Existing benchmarks significantly overestimated model performance. For example, a state-of-the-art model achieved an F1 score of 68.26\% on BigVul but only 3.09\% on PrimeVul.
  • Challenges in Realistic Evaluation: Code LMs struggle in realistic settings, as highlighted by the considerable disparity in performance between PrimeVul and previously used datasets.
  • Advanced Training Techniques: Exploration of class weights and contrastive learning as advanced training techniques showed marginal improvements. Larger models, including GPT-3.5 and GPT-4, were also evaluated with limited success, emphasizing the need for novel approaches in model development for effective vulnerability detection.

Conclusion and Future Directions

The introduction of PrimeVul and the insights gained from its evaluation offer a stark depiction of the current capabilities of Code LMs in vulnerability detection. This work underscores the intricacy of deploying Code LMs in security roles and signals a call-to-action for innovative research efforts. Future directions might include enhancing model understanding of software security through pre-training modifications or hybrid methodologies combining Code LMs with traditional program analysis tools. Through continued exploration and adaptation, the field can strive toward models that better grasp and predict vulnerabilities in software code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Large language models for software engineering: A systematic literature review, 2024.
  2. GitHub. Github copilot: Your ai pair programmer. https://copilot.github.com/, 2021.
  3. Amazon. Amazon codewhisperer: Build applications faster and more securely with your ai coding companion. https://aws.amazon.com/codewhisperer/, 2023.
  4. Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering, 2021.
  5. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’23, page 654–668, New York, NY, USA, 2023. Association for Computing Machinery.
  6. An empirical study of deep learning models for vulnerability detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2237–2248, 2023.
  7. Linevul: A transformer-based line-level vulnerability prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 608–620, 2022.
  8. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  9. Ac/c++ code vulnerability dataset with code changes and cve summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, pages 508–512, 2020.
  10. Cvefixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, pages 30–39, 2021.
  11. Crossvul: a cross-language vulnerability dataset with commit data. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1565–1569, 2021.
  12. Large language models for code: Security hardening and adversarial testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS ’23, page 1865–1879, New York, NY, USA, 2023. Association for Computing Machinery.
  13. Starcoder 2 and the stack v2: The next generation, 2024.
  14. OpenAI. Gpt-4 technical report, 2024.
  15. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681, 2018.
  16. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing, 19(4):2244–2258, 2021.
  17. Report on the static analysis tool exposition (sate) iv. NIST Special Publication, 500:297, 2013.
  18. National Institute of Standards and Technology. Nist software assurance reference dataset, Last accessed on March 19, 2023.
  19. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019.
  20. Limits of machine learning for automatic vulnerability detection, 2023.
  21. Confident learning: Estimating uncertainty in dataset labels. J. Artif. Int. Res., 70:1373–1411, may 2021.
  22. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
  23. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
  24. UniXcoder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  25. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
  26. CodeT5+: Open code large language models for code understanding and generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore, December 2023. Association for Computational Linguistics.
  27. Do language models learn semantics of code? a case study in vulnerability detection, 2023.
  28. Microsoft. Codexglue – defect detection, 2019.
  29. Benjamin Steenhoek. Hugging face datasets, 2024.
  30. Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, page 143–153, New York, NY, USA, 2019. Association for Computing Machinery.
  31. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 281–293, New York, NY, USA, 2014. Association for Computing Machinery.
  32. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4), jul 2018.
  33. Memorization and generalization in neural code intelligence models. Information and Software Technology, 153:107066, 2023.
  34. Concord: Clone-aware contrastive learning for source code. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, page 26–38, New York, NY, USA, 2023. Association for Computing Machinery.
  35. The counterfeit conundrum: Can code language models grasp the nuances of their incorrect generations? arXiv preprint arXiv:2402.19475, 2024.
  36. Codegen2: Lessons for training llms on programming and natural languages, 2023.
  37. Training language models to follow instructions with human feedback, 2022.
  38. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  39. Survey on deep learning with class imbalance. Journal of Big Data, 2019.
  40. SimCSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP), 2021.
  41. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, jan 2014.
  42. Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169, 2023.
  43. Enhancing static analysis for practical bug detection: An llm-integrated approach. In Proceedings of Proceedings of the ACM on Programming Languages (PACMPL), Issue OOPSLA, 2024.
  44. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning. arXiv preprint arXiv:2401.16185, 2024.
  45. Unified pre-training for program understanding and generation. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2668, Online, June 2021. Association for Computational Linguistics.
  46. Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023. Reproducibility Certification.
  47. Can large language models identify and reason about security vulnerabilities? not yet. arXiv preprint arXiv:2312.12575, 2023.
  48. The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 272–292, Toronto, Canada, July 2023. Association for Computational Linguistics.
  49. Can large language models identify and reason about security vulnerabilities? Not yet, December 2023.
  50. Data quality for software vulnerability datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yangruibo Ding (17 papers)
  2. Yanjun Fu (4 papers)
  3. Omniyyah Ibrahim (1 paper)
  4. Chawin Sitawarin (26 papers)
  5. Xinyun Chen (80 papers)
  6. Basel Alomair (14 papers)
  7. David Wagner (67 papers)
  8. Baishakhi Ray (88 papers)
  9. Yizheng Chen (23 papers)
Citations (25)