Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study (2401.11105v1)
Abstract: Collecting relevant and high-quality data is integral to the development of effective Software Vulnerability (SV) prediction models. Most of the current SV datasets rely on SV-fixing commits to extract vulnerable functions and lines. However, none of these datasets have considered latent SVs existing between the introduction and fix of the collected SVs. There is also little known about the usefulness of these latent SVs for SV prediction. To bridge these gaps, we conduct a large-scale study on the latent vulnerable functions in two commonly used SV datasets and their utilization for function-level and line-level SV predictions. Leveraging the state-of-the-art SZZ algorithm, we identify more than 100k latent vulnerable functions in the studied datasets. We find that these latent functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions, yet they have a noise level of around 6%. Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs. The improvements are up to 24.5% in the performance (F1-Score) of function-level SV predictions and up to 67% in the effectiveness of localizing vulnerable lines. Overall, our study presents the first promising step toward the use of latent SVs to improve the quality of SV datasets and enhance the performance of SV prediction tasks.
- [n. d.]. The Chromium project. https://github.com/chromium/chromium
- [n. d.]. Issue of missing links to vulnerability fixing commits in the ReVeal dataset. https://github.com/VulDetProject/ReVeal/issues/13
- [n. d.]. The video formats of the FFmpeg project. https://ffmpeg.org/ffmpeg-formats.html
- Cleaning the NVD: Comprehensive quality assessment, improvements, and analyses. IEEE Transactions on Dependable and Secure Computing 19, 6 (2021), 4255–4269.
- Sok: Machine learning for continuous integration. In 2023 IEEE/ACM International Workshop on Cloud Intelligence & AIOps (AIOps). IEEE, 8–13.
- Authors. [n. d.]. Reproduction package. https://github.com/lhmtriet/Latent-Vulnerability
- Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: A critical review and guidelines. Empirical Software Engineering 27, 4 (2022), 94.
- V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities. In Proceedings of the 44th International Conference on Software Engineering. 2352–2364.
- A survey on data augmentation for text classification. Comput. Surveys 55, 7 (2022), 1–39.
- Szz unleashed: an open implementation of the szz algorithm-featuring example usage in a study of just-in-time bug prediction for the jenkins project. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation. 7–12.
- Identifying the characteristics of vulnerable code changes: An empirical study. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. 257–268.
- Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101.
- Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques. In Proceedings of the 19th International Conference on Mining Software Repositories. 464–468.
- Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering (2021).
- Semi-supervised learning. IEEE Transactions on Neural Networks 20, 3 (2009), 542–542.
- William G Cochran. 2007. Sampling techniques. John Wiley & Sons.
- Noisy label learning for security defects. In Proceedings of the 19th International Conference on Mining Software Repositories. 435–447.
- Data quality for software vulnerability datasets. In Proceedings of the 45th International Conference on Software Engineering.
- Data preparation for software vulnerability prediction: A systematic literature review. IEEE Transactions on Software Engineering 49, 3 (2022), 1044–1063.
- A framework for evaluating the results of the szz approach for identifying bug-introducing changes. IEEE Transactions on Software Engineering 43, 7 (2016), 641–657.
- Automated security assessment for the Internet of Things. In 2021 IEEE 26th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 47–56.
- A C/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.
- The impact of changes mislabeled by SZZ on just-in-time defect prediction. IEEE Trans Softw Eng 47, 8 (2021), 1559–1586.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories. 608–620.
- VulRepair: A T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 935–947.
- Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. 2017. Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Computing Surveys (CSUR) 50, 4 (2017), 1–36.
- The rise of software vulnerability: Taxonomy of software vulnerabilities detection and machine learning approaches. Journal of Network and Computer Applications 179 (2021), 103009.
- 9.6 million links in source code comments: Purpose, evolution, and decay. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1211–1221.
- Problems with SZZ and features: An empirical study of the state of practice of defect prediction data collection. Empirical Software Engineering 27, 2 (2022), 42.
- DeepJIT: An end-to-end deep learning framework for just-in-time defect prediction. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 34–45.
- An empirical study of model-agnostic techniques for defect prediction models. IEEE Transactions on Software Engineering 48, 1 (2020), 166–185.
- Practitioners’ perceptions of the goals and visual explanations of defect prediction models. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 432–443.
- Automatic identification of bug-introducing changes. In 21st IEEE/ACM international conference on automated software engineering (ASE’06). IEEE, 81–90.
- Triet HM Le. 2022. Towards an improved understanding of software vulnerability assessment using data-driven approaches. arXiv preprint arXiv:2207.11708 (2022).
- Deep learning for source code modeling and generation: Models, applications, and challenges. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–38.
- A survey on data-driven software vulnerability assessment and prioritization. Comput. Surveys 55, 5 (2022), 1–39.
- Triet Huynh Minh Le and M Ali Babar. 2022. On the use of fine-grained vulnerable code statements for software vulnerability assessment models. In Proceedings of the 19th International Conference on Mining Software Repositories. 621–633.
- A large-scale study of security vulnerability support on developer q&a websites. In Evaluation and assessment in software engineering. 109–118.
- PUMiner: Mining security posts from developer question and answer websites with PU learning. In Proceedings of the 17th International Conference on Mining Software Repositories. 350–361.
- Deepcva: Automated commit-level vulnerability assessment with deep multi-task learning. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 717–729.
- Automated software vulnerability assessment with concept drift. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 371–382.
- Frank Li and Vern Paxson. 2017. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2201–2215.
- Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303.
- Software vulnerability detection using deep neural networks: A survey. Proc. IEEE 108, 10 (2020), 1825–1848.
- Vulnerability dataset construction methods applied to vulnerability detection: A survey. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 141–146.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Just-in-time software vulnerability detection: Are we there yet? Journal of Systems and Software 188 (2022), 111283.
- Mary L McHugh. 2012. Interrater reliability: The kappa statistic. Biochemia medica 22, 3 (2012), 276–282.
- When a patch goes bad: Exploring the properties of vulnerability-contributing commits. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE, 65–74.
- Tom Mens and Tom Tourwé. 2004. A survey of software refactoring. IEEE Transactions on software engineering 30, 2 (2004), 126–139.
- The impact of refactoring changes on the SZZ algorithm: An empirical study. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 380–390.
- Deep domain adaptation for vulnerable code function identification. In 2019 international joint conference on neural networks (IJCNN). IEEE, 1–8.
- HERMES: Using commit-issue linking to detect vulnerability-fixing commits. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 51–62.
- NIST. [n. d.]. National Vulnerability Database. https://nvd.nist.gov
- An empirical study of supplementary bug fixes. In 2012 9th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, 40–49.
- Why security defects go unnoticed during code reviews? A case-control study of the chromium os project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1373–1385.
- Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 426–437.
- On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology 135 (2021), 106552.
- SSPCatcher: Learning to catch security patches. Empirical Software Engineering 27, 6 (2022), 1–32.
- SecurityScorecard. [n. d.]. CVE Details Vulnerability Database. https://www.cvedetails.com
- When do changes induce fixes? ACM sigsoft software engineering notes 30, 4 (2005), 1–5.
- An empirical study of deep learning models for vulnerability detection. In Proceedings of the 45th International Conference on Software Engineering.
- Identifying linux bug fixing patches. In 34th International Conference on Software Engineering (ICSE). IEEE, 386–396.
- Maciej Tomczak and Ewa Tomczak. 2014. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. Trends in Sport Sciences 1, 21 (2014), 19–25.
- Gerald M Weinberg. 2008. Perfect Software and other illusions about testing. Dorset House Pub. New York, NY, USA.
- Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics. Springer, 196–202.
- Vuldigger: A just-in-time and cost-aware tool for digging vulnerability-contributing changes. In GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE, 1–7.
- D2a: A dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 111–120.
- Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems 32 (2019).
- Data augmentation approaches for source code models: A survey. arXiv preprint arXiv:2305.19915 (2023).
- Triet H. M. Le (14 papers)
- Xiaoning Du (27 papers)
- M. Ali Babar (71 papers)