Papers
Topics
Authors
Recent
2000 character limit reached

DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism (2309.15324v1)

Published 27 Sep 2023 in cs.CR

Abstract: One of the most pressing threats to computing systems is software vulnerabilities, which can compromise both hardware and software components. Existing methods for vulnerability detection remain suboptimal. Traditional techniques are both time-consuming and labor-intensive, while machine-learning-based approaches often underperform when applied to complex datasets, due to their inability to capture high-dimensional relationships. Previous deep-learning strategies also fall short in capturing sufficient feature information. Although self-attention mechanisms can process information over long distances, they fail to capture structural information. In this paper, we introduce DefectHunter, an innovative model for vulnerability identification that employs the Conformer mechanism. This mechanism fuses self-attention with convolutional networks to capture both local, position-wise features and global, content-based interactions. Furthermore, we optimize the self-attention mechanisms to mitigate the issue of excessive attention heads introducing extraneous noise by adjusting the denominator. We evaluated DefectHunter against ten baseline methods using six industrial and two highly complex datasets. On the QEMU dataset, DefectHunter exhibited a 20.62\% improvement in accuracy over Pongo-70B, and for the CWE-754 dataset, its accuracy was 14.64\% higher. To investigate how DefectHunter comprehends vulnerabilities, we conducted a case study, which revealed that our model effectively understands the mechanisms underlying vulnerabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Flawfinder.” [Online]. Available: https://www.dwheeler.com/flawfinder/
  2. “Findbugs.” [Online]. Available: https://findbugs.sourceforge.net/
  3. H. Perl et al., “Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15.   New York, NY, USA: Association for Computing Machinery, 2015, p. 426–437. [Online]. Available: https://doi.org/10.1145/2810103.2813604
  4. S. M. Ghaffarian et al., “Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey,” ACM Comput. Surv., vol. 50, no. 4, aug 2017. [Online]. Available: https://doi.org/10.1145/3092566
  5. H. Wang et al., “Combining graph-based learning with automated data collection for code vulnerability detection,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1943–1958, 2021.
  6. Y. Zhou et al., “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” Advances in neural information processing systems, vol. 32, 2019.
  7. Z. Feng et al., “Codebert: A pre-trained model for programming and natural languages,” CoRR, vol. abs/2002.08155, 2020. [Online]. Available: https://arxiv.org/abs/2002.08155
  8. Y. Wang et al., “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.
  9. “Gpt-4.” [Online]. Available: https://openai.com/gpt-4
  10. A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” 2020.
  11. E. Miller, “Attention is off by one evan miller,” https://www.evanmiller.org/attention-is-off-by-one.html, 07 2023, (undefined 5/8/2023 13:48).
  12. A. Vaswani et al., “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available: http://arxiv.org/abs/1706.03762
  13. T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.   Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
  14. D. Guo et al., “Unixcoder: Unified cross-modal pre-training for code representation,” 2022.
  15. F. Chollet et al., “Keras,” https://keras.io, 2015.
  16. “Nist software assurance reference dataset,” https://samate.nist.gov/SARD/, (Accessed on 09/04/2022).
  17. Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Doha, Qatar: Association for Computational Linguistics, oct 2014, pp. 1746–1751. [Online]. Available: https://aclanthology.org/D14-1181
  18. Z. Li et al., “Vuldeepecker: A deep learning-based system for vulnerability detection,” in 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018.   The Internet Society, 2018. [Online]. Available: http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018_03A-2_Li_paper.pdf
  19. J. Wang et al., “Deepvulseeker: A novel vulnerability identification framework via code graph structure and pre-training mechanism,” Future Generation Computer Systems, vol. 148, pp. 15–26, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X23001978
  20. M. Chen et al., “Evaluating large language models trained on code,” 2021.
  21. H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023.
  22. T. Dettmers et al., “Qlora: Efficient finetuning of quantized llms,” 2023.
  23. “Checkmarx.” [Online]. Available: https://www.checkmarx.com/
  24. S. Cui et al., “Vrust: Automated vulnerability detection for solana smart contracts,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 639–652. [Online]. Available: https://doi.org/10.1145/3548606.3560552
  25. M. Johns et al., “End-to-end taint tracking for detection and mitigation of injection vulnerabilities in web applications,” US Patent 10,129,285, 2018, query date: 2023-09-15 10:23:18. [Online]. Available: https://patents.google.com/patent/US10129285B2/en
  26. P. Wang et al., “Dftracker: detecting double-fetch bugs by multi-taint parallel tracking,” Frontiers of Computer Science, vol. 13, pp. 247–263, 2019.
  27. H. Zhang et al., “Statically discovering high-order taint style vulnerabilities in os kernels,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 811–824. [Online]. Available: https://doi.org/10.1145/3460120.3484798
  28. W. Kang et al., “Tracer: Signature-based static analysis for detecting recurring vulnerabilities,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 1695–1708. [Online]. Available: https://doi.org/10.1145/3548606.3560664
  29. C. Luo et al., “Tchecker: Precise static inter-procedural analysis for detecting taint-style vulnerabilities in php applications,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 2175–2188. [Online]. Available: https://doi.org/10.1145/3548606.3559391
  30. R. Baldoni et al., “A survey of symbolic execution techniques,” ACM Computing Surveys (CSUR), 2018, query date: 2023-09-15 10:28:28. [Online]. Available: https://dl.acm.org/doi/abs/10.1145/3182657
  31. D. Wang et al., “Wana: Symbolic execution of wasm bytecode for cross-platform smart contract vulnerability detection,” arXiv preprint arXiv:2007.15510, 2020, query date: 2023-09-15 10:28:28. [Online]. Available: https://arxiv.org/abs/2007.15510
  32. S. T. Dinh et al., “Favocado: Fuzzing the binding code of javascript engines using semantically correct test cases,” in Network and Distributed System Security Symposium, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:231591466
  33. J. He et al., “Learning to fuzz from symbolic execution with application to smart contracts,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2019, pp. 531–548.
  34. H. Sun et al., “Vdsimilar: Vulnerability detection based on code similarity of vulnerabilities and patches,” Computers &Security, 2021, query date: 2023-09-15 10:40:40. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404821002418
  35. W. L. Al-Yaseen et al., “Multi-level hybrid support vector machine and extreme learning machine based on modified k-means for intrusion detection system,” Expert Systems with Applications, vol. 67, pp. 296–303, 2017.
  36. F. Lomio et al., “Just-in-time software vulnerability detection: Are we there yet?” Journal of Systems and Software, vol. 188, p. 111283, 2022.
  37. M. Zolanvari et al., “Machine learning-based network vulnerability analysis of industrial internet of things,” IEEE Internet of Things Journal, vol. 6, no. 4, pp. 6822–6834, Aug 2019.
  38. D. Zou et al., “Vuldeepecker: A deep learning-based system for multiclass vulnerability detection,” IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 5, pp. 2224–2236, 2021.
  39. M. Allamanis et al., “Learning to represent programs with graphs,” arXiv preprint arXiv:1711.00740, 2017.
  40. B. Steenhoek et al., “An empirical study of deep learning models for vulnerability detection,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 2237–2248.
  41. D. Hin et al., “Linevd: Statement-level vulnerability detection using graph neural networks,” in Proceedings of the 19th International Conference on Mining Software Repositories, ser. MSR ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 596–607. [Online]. Available: https://doi.org/10.1145/3524842.3527949
  42. “Llama.” [Online]. Available: https://ai.meta.com/llama/
  43. “Codex.” [Online]. Available: https://openai.com/blog/openai-codex/
  44. “Chatgpt.” [Online]. Available: https://chat.openai.com
  45. H. Pearce et al., “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP), 2023, pp. 2339–2356.
  46. A. Cheshkov et al., “Evaluation of chatgpt model for vulnerability detection,” arXiv preprint arXiv:2304.07232, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.