Papers
Topics
Authors
Recent
2000 character limit reached

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection (2205.12424v1)

Published 25 May 2022 in cs.CR, cs.AI, and cs.LG

Abstract: This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. MITRE, “Browse CVE vulnerabilities by date,” 2021. [Online]. Available: https://www.cvedetails.com/browse-by-date.php
  2. H. Hanif, M. H. N. Md Nasir, M. F. Ab Razak, A. Firdaus, and N. B. Anuar, “The rise of software vulnerability: Taxonomy of software vulnerabilities detection and machine learning approaches,” Journal of Network and Computer Applications, vol. 179, p. 103009, 2021.
  3. ——, “Sysevr: A framework for using deep learning to detect software vulnerabilities,” IEEE Transactions on Dependable and Secure Computing, p. 1–1, 2021.
  4. D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, “muvuldeepecker: A deep learning-based system for multiclass vulnerability detection,” IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 5, pp. 2224–2236, 2021.
  5. Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32, Vancouver, Canada, 2019.
  6. S. Cao, X. Sun, L. Bo, Y. Wei, and B. Li, “Bgnn4vd: Constructing bidirectional graph neural-network for vulnerability detection,” Information and Software Technology, vol. 136, p. 106576, 2021.
  7. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017.
  8. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1.   Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  9. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2020.
  10. P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in 2021 International Conference on Learning Representations, May 2021.
  11. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.   Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 353–355.
  12. P. Gage, “A new algorithm for data compression,” C Users J., vol. 12, no. 2, p. 23–38, Feb. 1994.
  13. LLVM, “libclang: C interface to clang,” 2021. [Online]. Available: https://clang.llvm.org/doxygen/group__CINDEX.html
  14. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019.
  15. L. Buratti, S. Pujar, M. Bornea, S. McCarley, Y. Zheng, G. Rossiello, A. Morari, J. Laredo, V. Thost, Y. Zhuang, and G. Domeniconi, “Exploring software naturalness through neural language models,” 2020.
  16. R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. McConley, “Automated vulnerability detection in source code using deep representation learning,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).   Orlando, FL, USA: IEEE, 2018, pp. 757–762.
  17. F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” in 2014 IEEE Symposium on Security and Privacy.   California, USA: IEEE, 2014, pp. 590–604.
  18. H. Perl, S. Dechand, M. Smith, D. Arp, F. Yamaguchi, K. Rieck, S. Fahl, and Y. Acar, “Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15.   NY, USA: Association for Computing Machinery, 2015, p. 426–437.
  19. J. Stuckman, J. Walden, and R. Scandariato, “The effect of dimensionality reduction on software vulnerability prediction models,” IEEE Transactions on Reliability, vol. 66, no. 1, pp. 17–37, 2017.
  20. X. Cheng, H. Wang, J. Hua, G. Xu, and Y. Sui, “Deepwukong: Statically detecting software vulnerabilities using deep graph neural network,” ACM Trans. Softw. Eng. Methodol., vol. 30, no. 3, Apr. 2021.
  21. R. Rabheru, H. Hanif, and S. Maffeis, “DeepTective: Detection of PHP vulnerabilities using hybrid graph neural networks,” in 2022 IEEE Conference on Dependable and Secure Computing (DSC), 2022, pp. 1–8.
  22. R. Yan, X. Xiao, G. Hu, S. Peng, and Y. Jiang, “New deep learning method to detect code injection attacks on hybrid applications,” Journal of Systems and Software, vol. 137, pp. 67–77, 2018.
  23. S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet,” IEEE Transactions on Software Engineering, 2021.
  24. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020.   Online: Association for Computational Linguistics, Nov. 2020, pp. 1536–1547.
  25. B. Roziere, M.-A. Lachaux, M. Szafraniec, and G. Lample, “Dobf: A deobfuscation pre-training objective for programming languages,” 2021.
  26. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” 2021.
  27. Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1746–1751.
  28. PyGithub, “PyGithub: Typed interactions with the GitHub API v3,” 2021. [Online]. Available: https://github.com/PyGithub/PyGithub
  29. NIST, “Juliet test suite 1.3,” 2017. [Online]. Available: https://samate.nist.gov/SRD/testsuite.php
  30. National Institute of Standards and Technology, “National Vulnerability Database,” 2021, [Accessed November 10, 2021]. [Online]. Available: https://nvd.nist.gov
  31. ——, “Software Assurance Reference Dataset,” 2021, [Accessed November 10, 2021]. [Online]. Available: https://samate.nist.gov/SARD
  32. P. et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.
  33. A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005, iJCNN 2005.
  34. W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Association for Computational Linguistics, Jun. 2021, pp. 2655–2668.
  35. D. Coimbra, S. Reis, R. Abreu, C. Păsăreanu, and H. Erdogmus, “On using distributed representations of source code for the detection of c security vulnerabilities,” 2021.
  36. Y. Zhuang, S. Suneja, V. Thost, G. Domeniconi, A. Morari, and J. Laredo, “Software vulnerability detection via deep learning over disaggregated code graph representation,” 2021.
Citations (76)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.