LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability Types (2306.06935v1)
Abstract: Prior studies generally focus on software vulnerability detection and have demonstrated the effectiveness of Graph Neural Network (GNN)-based approaches for the task. Considering the various types of software vulnerabilities and the associated different degrees of severity, it is also beneficial to determine the type of each vulnerable code for developers. In this paper, we observe that the distribution of vulnerability type is long-tailed in practice, where a small portion of classes have massive samples (i.e., head classes) but the others contain only a few samples (i.e., tail classes). Directly adopting previous vulnerability detection approaches tends to result in poor detection performance, mainly due to two reasons. First, it is difficult to effectively learn the vulnerability representation due to the over-smoothing issue of GNNs. Second, vulnerability types in tails are hard to be predicted due to the extremely few associated samples.To alleviate these issues, we propose a Long-taIled software VulnerABiLity typE classification approach, called LIVABLE. LIVABLE mainly consists of two modules, including (1) vulnerability representation learning module, which improves the propagation steps in GNN to distinguish node representations by a differentiated propagation method. A sequence-to-sequence model is also involved to enhance the vulnerability representations. (2) adaptive re-weighting module, which adjusts the learning weights for different types according to the training epochs and numbers of associated samples by a novel training loss.
- Google., “Key statistics of the google bug bounty program,” 2022. [Online]. Available: https://bughunters.google.com/about/key-stats
- “Common weakness enumeration,” [n.d.]. [Online]. Available: https://cwe.mitre.org/data/definitions/119.html
- Microsoft. (2021) Microsoft bug bounty programs year in review: $13.6m in rewards. [Online]. Available: https://msrc-blog.microsoft.com/2021/07/08/microsoft-bug-bounty-programs-year-in-review-13-6m-in-rewards/
- Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep learning-based system for vulnerability detection,” in 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018.
- R. L. Russell, L. Y. Kim, L. H. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. M. Ellingwood, and M. W. McConley, “Automated vulnerability detection in source code using deep representation learning,” in 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, December 17-20, 2018, M. A. Wani, M. M. Kantardzic, M. S. Mouchaweh, J. Gama, and E. Lughofer, Eds. IEEE, 2018, pp. 757–762.
- Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A framework for using deep learning to detect software vulnerabilities,” IEEE Trans. Dependable Secur. Comput., vol. 19, no. 4, pp. 2244–2258, 2022.
- Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10 197–10 207.
- S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet?” IEEE Trans. Software Eng., vol. 48, no. 9, pp. 3280–3296, 2022.
- Y. Li, S. Wang, and T. N. Nguyen, “Vulnerability detection with fine-grained interpretations,” in ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, D. Spinellis, G. Gousios, M. Chechik, and M. D. Penta, Eds. ACM, 2021, pp. 292–303.
- J. L. Elman, “Finding structure in time,” Cogn. Sci., vol. 14, no. 2, pp. 179–211, 1990.
- M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2, 2005, pp. 729–734 vol. 2.
- Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph sequence neural networks,” in 4th International Conference on Learning Representations, ICLR 2016, 2016.
- “National vulnerability database,” [n.d.]. [Online]. Available: https://nvd.nist.gov/
- B. Shuai, H. Li, M. Li, Q. Zhang, and C. Tang, “Automatic classification for vulnerability based on machine learning,” in IEEE International Conference on Information and Automation, ICIA 2013, Yinchuan, China, August 26-28, 2013. IEEE, 2013, pp. 312–318.
- S. Na, T. Kim, and H. Kim, “A study on the classification of common vulnerabilities and exposures using naïve bayes,” in Proceedings of the 11th International Conference On Broad-Band Wireless Computing, Communication and Applications, ser. Lecture Notes on Data Engineering and Communications Technologies, vol. 2. Springer, 2016, pp. 657–662.
- D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, “μ𝜇\muitalic_μvuldeepecker: A deep learning-based system for multiclass vulnerability detection,” IEEE Trans. Dependable Secur. Comput., vol. 18, no. 5, pp. 2224–2236, 2021.
- “Common vulnerability scoring system sig,” [n.d.]. [Online]. Available: https://www.first.org/cvss/
- “Common vulnerability scoring system version 3.0,” [n.d.]. [Online]. Available: https://www.first.org/cvss/v3-0/
- “Common vulnerability scoring system version 3.0 qualitative-severity-rating-scale,” [n.d.]. [Online]. Available: https://www.first.org/cvss/v3.0/specification-document#Qualitative-Severity-Rating-Scale
- “Common weakness enumeration,” [n.d.]. [Online]. Available: https://cwe.mitre.org/data/definitions/507.html
- D. Lukovnikov and A. Fischer, “Improving breadth-wise backpropagation in graph neural networks helps learning long-range dependencies,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 7180–7191.
- U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax tree,” in Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, J. M. Atlee, T. Bultan, and J. Whittle, Eds. IEEE / ACM, 2019, pp. 783–794.
- J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A C/C++ code vulnerability dataset with code changes and CVE summaries,” in MSR ’20: 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea, 29-30 June, 2020, S. Kim, G. Gousios, S. Nadi, and J. Hejderup, Eds. ACM, 2020, pp. 508–512.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” CoRR, vol. abs/1710.10903, 2017.
- K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, 2018, pp. 5449–5458.
- Y. Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng, “Deep long-tailed learning: A survey,” CoRR, vol. abs/2110.04596, 2021.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 2818–2826.
- J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, and J. Yan, “Equalization loss for long-tailed object recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 11 659–11 668.
- Y. Cui, M. Jia, T. Lin, Y. Song, and S. J. Belongie, “Class-balanced loss based on effective number of samples,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 9268–9277.
- T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2017, pp. 2999–3007.
- Z. Zhong, J. Cui, S. Liu, and J. Jia, “Improving calibration for long-tailed recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 16 489–16 498.
- Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 8792–8802.
- H. Guo and S. Wang, “Long-tailed multi-label visual recognition by collaborative training on uniform and re-balanced samplings,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 15 089–15 098. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2021/html/Guo\_Long-Tailed\_Multi-Label\_Visual\_Recognition\_by\_Collaborative\_Training\_on\_Uniform\_and\_CVPR\_2021\_paper.html
- K. W. Church, “Word2vec,” Nat. Lang. Eng., vol. 23, no. 1, pp. 155–162, 2017.
- Y. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, J. Fürnkranz and T. Joachims, Eds. Omnipress, 2010, pp. 111–118. [Online]. Available: https://icml.cc/Conferences/2010/papers/638.pdf
- K. Yue, F. Xu, and J. Yu, “Shallow and wide fractional max-pooling network for image classification,” Neural Comput. Appl., vol. 31, no. 2, pp. 409–419, 2019.
- S. Woo, J. Park, J. Lee, and I. S. Kweon, “CBAM: convolutional block attention module,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, ser. Lecture Notes in Computer Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11211. Springer, 2018, pp. 3–19.
- P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, “Attention-based bidirectional long short-term memory networks for relation classification,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics, 2016.
- B. Zhou, Q. Cui, X. Wei, and Z. Chen, “BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 9716–9725.
- “Common weakness enumeration,” [n.d.]. [Online]. Available: http://cwe.mitre.org/
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, 2013.
- M. Lukasik, S. Bhojanapalli, A. K. Menon, and S. Kumar, “Does label smoothing mitigate label noise?” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 6448–6458.
- R. Croft, M. A. Babar, and M. M. Kholoosi, “Data quality for software vulnerability datasets,” CoRR, vol. abs/2301.05456, 2023.
- V. der Maaten, Laurens, and H. Geoffrey, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- “Common weakness enumeration,” [n.d.]. [Online]. Available: https://cwe.mitre.org/data/definitions/22.html
- S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A sense of self for unix processes,” in 1996 IEEE Symposium on Security and Privacy, May 6-8, 1996, Oakland, CA, USA. IEEE Computer Society, 1996, pp. 120–128.
- F. Yamaguchi, F. F. Lindner, and K. Rieck, “Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning,” in 5th USENIX Workshop on Offensive Technologies, WOOT’11, August 8, 2011, San Francisco, CA, USA, Proceedings, D. Brumley and M. Zalewski, Eds. USENIX Association, 2011, pp. 118–127.
- I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. G. Bringas, “OPEM: A static-dynamic approach for machine-learning-based malware detection,” in International Joint Conference CISIS’12-ICEUTE’12-SOCO’12 Special Sessions, Ostrava, Czech Republic, September 5th-7th, 2012, ser. Advances in Intelligent Systems and Computing, Á. Herrero, V. Snásel, A. Abraham, I. Zelinka, B. Baruque, H. Quintián-Pardo, J. L. Calvo-Rolle, J. Sedano, and E. Corchado, Eds., vol. 189. Springer, 2012, pp. 271–280.
- S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting vulnerable software components,” in Proceedings of the 2007 ACM Conference on Computer and Communications Security, CCS 2007, Alexandria, Virginia, USA, October 28-31, 2007, P. Ning, S. D. C. di Vimercati, and P. F. Syverson, Eds. ACM, 2007, pp. 529–540.
- Y. Shin, A. Meneely, L. A. Williams, and J. A. Osborne, “Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities,” IEEE Trans. Software Eng., vol. 37, no. 6, pp. 772–787, 2011.
- S. Neuhaus and T. Zimmermann, “The beauty and the beast: Vulnerabilities in red hat’s packages,” in 2009 USENIX Annual Technical Conference, San Diego, CA, USA, June 14-19, 2009, G. M. Voelker and A. Wolman, Eds. USENIX Association, 2009.
- G. Grieco, G. L. Grinblat, L. C. Uzal, S. Rawat, J. Feist, and L. Mounier, “Toward large-scale vulnerability discovery using machine learning,” in Proceedings of the Sixth ACM on Conference on Data and Application Security and Privacy, CODASPY 2016, New Orleans, LA, USA, March 9-11, 2016, E. Bertino, R. S. Sandhu, and A. Pretschner, Eds. ACM, 2016, pp. 85–96.
- G. Lin, J. Zhang, W. Luo, L. Pan, and Y. Xiang, “POSTER: vulnerability discovery with function representation learning from unlabeled projects,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, B. Thuraisingham, D. Evans, T. Malkin, and D. Xu, Eds. ACM, 2017, pp. 2539–2541.
- J. Li, P. He, J. Zhu, and M. R. Lyu, “Software defect prediction via convolutional neural network,” in 2017 IEEE International Conference on Software Quality, Reliability and Security, QRS 2017, Prague, Czech Republic, July 25-29, 2017. IEEE, 2017, pp. 318–328.
- J. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y. Kim, and P. Chin, “Learning to repair software vulnerabilities with generative adversarial networks,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 7944–7954.
- S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, L. K. Dillon, W. Visser, and L. A. Williams, Eds. ACM, 2016, pp. 297–308.
- M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016, D. Lo, S. Apel, and S. Khurshid, Eds. ACM, 2016, pp. 87–98.
- S. Park, J. Lim, Y. Jeon, and J. Y. Choi, “Influence-balanced loss for imbalanced visual classification,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 715–724.
- Xin-Cheng Wen (16 papers)
- Cuiyun Gao (97 papers)
- Feng Luo (91 papers)
- Haoyu Wang (309 papers)
- Ge Li (213 papers)
- Qing Liao (42 papers)