Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toward Improved Deep Learning-based Vulnerability Detection (2403.03024v1)

Published 5 Mar 2024 in cs.SE

Abstract: Deep learning (DL) has been a common thread across several recent techniques for vulnerability detection. The rise of large, publicly available datasets of vulnerabilities has fueled the learning process underpinning these techniques. While these datasets help the DL-based vulnerability detectors, they also constrain these detectors' predictive abilities. Vulnerabilities in these datasets have to be represented in a certain way, e.g., code lines, functions, or program slices within which the vulnerabilities exist. We refer to this representation as a base unit. The detectors learn how base units can be vulnerable and then predict whether other base units are vulnerable. We have hypothesized that this focus on individual base units harms the ability of the detectors to properly detect those vulnerabilities that span multiple base units (or MBU vulnerabilities). For vulnerabilities such as these, a correct detection occurs when all comprising base units are detected as vulnerable. Verifying how existing techniques perform in detecting all parts of a vulnerability is important to establish their effectiveness for other downstream tasks. To evaluate our hypothesis, we conducted a study focusing on three prominent DL-based detectors: ReVeal, DeepWukong, and LineVul. Our study shows that all three detectors contain MBU vulnerabilities in their respective datasets. Further, we observed significant accuracy drops when detecting these types of vulnerabilities. We present our study and a framework that can be used to help DL-based detectors toward the proper inclusion of MBU vulnerabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. [n.d.]. Anonymous Project Website. https://anonymousseres.github.io/icse/
  2. [n.d.]a. CVE-2014-3647. https://nvd.nist.gov/vuln/detail/CVE-2014-3647
  3. [n.d.]b. CVE-2017-0596. https://nvd.nist.gov/vuln/detail/CVE-2017-0596
  4. [n.d.]c. CVE-2021-33815. https://nvd.nist.gov/vuln/detail/CVE-2021-33815
  5. [n.d.]. FFmpeg’s Website. https://ffmpeg.org/
  6. [n.d.]. LLVM. https://llvm.org/
  7. [n.d.]. Lua’ Website. https://www.lua.org/
  8. [n.d.]a. Partial fix of CVE-2014-3647. https://github.com/torvalds/linux/commit/234f3ce485d54017f15cf5e0699cff4100121601
  9. [n.d.]b. Partial fix of CVE-2014-3647. https://github.com/torvalds/linux/commit/d1442d85cc30ea75f7d399474ca738e0bc96f715
  10. [n.d.]. QEMU’s Website. https://www.qemu.org/
  11. [n.d.]. Redis’ Website. https://redis.io/
  12. [n.d.]. SARD’s Website. https://samate.nist.gov/SARD/
  13. Modeling and characterizing software vulnerabilities. (2017).
  14. Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering (2021).
  15. Deepwukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3 (2021), 1–33.
  16. Data Quality for Software Vulnerability Datasets. arXiv preprint arXiv:2301.05456 (2023).
  17. An investigation into inconsistency of software vulnerability severity across data sources. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 338–348.
  18. Data preparation for software vulnerability prediction: A systematic literature review. IEEE Transactions on Software Engineering (2022).
  19. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 313–324.
  20. A C/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.
  21. Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories. 608–620.
  22. Understanding Cloud Computing Vulnerabilities. IEEE Security & Privacy 9, 2 (2011), 50–57. https://doi.org/10.1109/MSP.2010.115
  23. LineVD: statement-level vulnerability detection using graph neural networks. In Proceedings of the 19th International Conference on Mining Software Repositories. 596–607.
  24. The importance of accounting for real-world labelling when predicting software vulnerabilities. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 695–705.
  25. David Kawrykow and Martin P Robillard. 2011. Non-essential changes in version histories. In Proceedings of the 33rd International Conference on Software Engineering. 351–360.
  26. Automatic clustering of code changes. In Proceedings of the 13th International Conference on Mining Software Repositories. 61–72.
  27. A survey on data-driven software vulnerability assessment and prioritization. Comput. Surveys 55, 5 (2022), 1–39.
  28. The anatomy of a vulnerability database: A systematic mapping study. Journal of Systems and Software (2023), 111679.
  29. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2244–2258.
  30. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018).
  31. Software vulnerability discovery techniques: A survey. In 2012 fourth international conference on multimedia information networking and security. IEEE, 152–156.
  32. Automatically extracting instances of code change patterns with ast analysis. In 2013 IEEE international conference on software maintenance. IEEE, 388–391.
  33. Detection of recurring software vulnerabilities. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 447–456.
  34. Frank Piessens. 2002. A taxonomy of causes of software vulnerabilities in internet software. In Supplementary Proceedings of the 13th International Symposium on Software Reliability Engineering. Citeseer, 47–52.
  35. Understanding Software Vulnerabilities Related to Architectural Security Tactics: An Empirical Investigation of Chromium, PHP and Thunderbird. In 2017 IEEE International Conference on Software Architecture (ICSA). 69–78. https://doi.org/10.1109/ICSA.2017.39
  36. Identifying casualty changes in software patches. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 304–315.
  37. Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE transactions on software engineering 37, 6 (2010), 772–787.
  38. Ajay S Singh and Micah B Masuku. 2014. Sampling techniques & determination of sample size in applied statistics research: An overview. International Journal of economics, commerce and management 2, 11 (2014), 1–22.
  39. Data quality matters: A case study on data label correctness for security bug report prediction. IEEE Transactions on Software Engineering 48, 7 (2021), 2541–2556.
  40. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590–604.
  41. Domain knowledge-based security bug reports prediction. Knowledge-Based Systems 241 (2022), 108293.
  42. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems 32 (2019).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Adriana Sejfia (4 papers)
  2. Satyaki Das (2 papers)
  3. Saad Shafiq (6 papers)
  4. Nenad Medvidović (2 papers)
Citations (7)