ActiveClean: Generating Line-Level Vulnerability Data via Active Learning (2312.01588v1)
Abstract: Deep learning vulnerability detection tools are increasing in popularity and have been shown to be effective. These tools rely on large volume of high quality training data, which are very hard to get. Most of the currently available datasets provide function-level labels, reporting whether a function is vulnerable or not vulnerable. However, for a vulnerability detection to be useful, we need to also know the lines that are relevant to the vulnerability. This paper makes efforts towards developing systematic tools and proposes. ActiveClean to generate the large volume of line-level vulnerability data from commits. That is, in addition to function-level labels, it also reports which lines in the function are likely responsible for vulnerability detection. In the past, static analysis has been applied to clean commits to generate line-level data. Our approach based on active learning, which is easy to use and scalable, provide a complementary approach to static analysis. We designed semantic and syntactic properties from commit lines and use them to train the model. We evaluated our approach on both Java and C datasets processing more than 4.3K commits and 119K commit lines. AcitveClean achieved an F1 score between 70-74. Further, we also show that active learning is effective by using just 400 training data to reach F1 score of 70.23. Using ActiveClean, we generate the line-level labels for the entire FFMpeg project in the Devign dataset, including 5K functions, and also detected incorrect function-level labels. We demonstrated that using our cleaned data, LineVul, a SOTA line-level vulnerability detection tool, detected 70 more vulnerable lines and 18 more vulnerable functions, and improved Top 10 accuracy from 66% to 73%.
- 2020. Joern - The Bug Hunter’s Workbench. https://joern.io/
- 2023. Zero Day Initiative — Looking Back at the Bugs of 2022. https://www.thezdi.com/blog/2023/1/4/looking-back-at-the-bugs-of-2022
- Helping Developers Help Themselves: Automatic Decomposition of Code Review Changesets. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, Florence, Italy, 134–144. https://doi.org/10.1109/ICSE.2015.35
- L. Breiman. 2001. Random Forests. Machine Learning 45 (2001), 5–32.
- Where is the bug and how is it fixed? an experiment with practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, Paderborn Germany, 117–128. https://doi.org/10.1145/3106237.3106255
- MVD: Memory-Related Vulnerability Detection Based on Flow-Sensitive Graph Neural Networks. In Proceedings of the 44th International Conference on Software Engineering. 1456–1468. https://doi.org/10.1145/3510003.3510219 arXiv:2203.02660 [cs].
- Deep Learning based Vulnerability Detection: Are We There Yet? arXiv:2009.07235 [cs] (Sept. 2020). http://arxiv.org/abs/2009.07235 arXiv: 2009.07235.
- Untangling Composite Commits by Attributed Graph Clustering. In 13th Asia-Pacific Symposium on Internetware. ACM, Hohhot China, 117–126. https://doi.org/10.1145/3545258.3545267
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
- David R Cox. 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 215–232.
- Data quality for software vulnerability datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 121–133.
- Valentin Dallmeier and Thomas Zimmermann. 2007. Extraction of bug localization benchmarks from history. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. ACM, Atlanta Georgia USA, 433–436. https://doi.org/10.1145/1321631.1321702
- Tivadar Danka and Peter Horvath. [n. d.]. modAL: A modular active learning framework for Python. ([n. d.]). https://github.com/modAL-python/modAL available on arXiv at https://arxiv.org/abs/1805.00979.
- Untangling fine-grained code changes. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 341–350. https://doi.org/10.1109/SANER.2015.7081844 ISSN: 1534-5351.
- Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact. Empirical Software Engineering 10, 4 (Oct. 2005), 405–435. https://doi.org/10.1007/s10664-005-3861-2
- A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 508–512. https://doi.org/10.1145/3379597.3387501
- Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories. ACM, Pittsburgh Pennsylvania, 608–620. https://doi.org/10.1145/3524842.3528452
- Automated patch extraction via syntax- and semantics-aware Delta debugging on source code changes. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 598–609. https://doi.org/10.1145/3236024.3236047
- A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits. arXiv:2011.06244 [cs] (Oct. 2021). http://arxiv.org/abs/2011.06244 arXiv: 2011.06244.
- Kim Herzig and Andreas Zeller. 2013. The impact of tangled code changes. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, San Francisco, CA, USA, 121–130. https://doi.org/10.1109/MSR.2013.6624018
- A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Tallinn Estonia, 510–520. https://doi.org/10.1145/3338906.3338955
- Md Rakibul Islam and Minhaz F. Zibran. 2020. How bugs are fixed: exposing bug-fix patterns with edits and nesting levels. In Proceedings of the 35th Annual ACM Symposium on Applied Computing. ACM, Brno Czech Republic, 1523–1531. https://doi.org/10.1145/3341105.3373880
- BugBuilder: An Automated Approach to Building Bug Repository. IEEE Transactions on Software Engineering 49, 4 (April 2023), 1443–1463. https://doi.org/10.1109/TSE.2022.3177713
- Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, San Jose CA USA, 437–440. https://doi.org/10.1145/2610384.2628055
- Hey! are you committing tangled changes?. In Proceedings of the 22nd International Conference on Program Comprehension - ICPC 2014. ACM Press, Hyderabad, India, 262–265. https://doi.org/10.1145/2597008.2597798
- Hey! are you committing tangled changes?. In Proceedings of the 22nd International Conference on Program Comprehension. 262–265.
- Splitting Commits via Past Code Changes. In 2016 23rd Asia-Pacific Software Engineering Conference (APSEC). 129–136. https://doi.org/10.1109/APSEC.2016.028 ISSN: 1530-1362.
- The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering 41, 12 (Dec. 2015), 1236–1256. https://doi.org/10.1109/TSE.2015.2454513
- UTANGO: untangling commits with context-aware, graph-based, code change clustering learning model. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 221–232.
- VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings 2018 Network and Distributed System Security Symposium. Internet Society, San Diego, CA. https://doi.org/10.14722/ndss.2018.23158
- Software Vulnerability Detection Using Deep Neural Networks: A Survey. Proc. IEEE 108, 10 (Oct. 2020), 1825–1848. https://doi.org/10.1109/JPROC.2020.2993293 Conference Name: Proceedings of the IEEE.
- BugBench: Benchmarks for Evaluating Bug Detection Tools. ([n. d.]).
- Prem Melville and Raymond J. Mooney. 2004. Diverse ensembles for active learning. In Twenty-first international conference on Machine learning - ICML ’04. ACM Press, Banff, Alberta, Canada, 74. https://doi.org/10.1145/1015330.1015385
- Toward an understanding of bug fix patterns. Empirical Software Engineering 14, 3 (June 2009), 286–315. https://doi.org/10.1007/s10664-008-9077-5
- Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
- Flexeme: untangling commits using lexical flows. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Virtual Event USA, 63–74. https://doi.org/10.1145/3368089.3409693
- Automated Vulnerability Detection in Source Code Using Deep Representation Learning. arXiv:1807.04320 [cs, stat] (Nov. 2018). http://arxiv.org/abs/1807.04320 arXiv: 1807.04320.
- Towards security defect prediction with AI. http://arxiv.org/abs/1808.09897 arXiv:1808.09897 [cs, stat].
- SmartCommit: a graph-based interactive assistant for activity-oriented commits. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Athens Greece, 379–390. https://doi.org/10.1145/3468264.3468551
- Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4J. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 130–140. https://doi.org/10.1109/SANER.2018.8330203 arXiv:1801.06393 [cs].
- Learning to map source code to software vulnerability using code-as-a-graph. (2020).
- CoRA: Decomposing and Describing Tangled Code Changes for Reviewer. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1050–1061. https://doi.org/10.1109/ASE.2019.00101 ISSN: 2643-1572.
- Context-aware patch generation for better automated program repair. In Proceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg Sweden, 1–11. https://doi.org/10.1145/3180155.3180233
- Data quality matters: A case study on data label correctness for security bug report prediction. IEEE Transactions on Software Engineering 48, 7 (2021), 2541–2556.
- Tracking patches for open source software vulnerabilities. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 860–871. https://doi.org/10.1145/3540250.3549125
- Modeling and Discovering Vulnerabilities with Code Property Graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, San Jose, CA, 590–604. https://doi.org/10.1109/SP.2014.44
- ChangeBeadsThreader: An Interactive Environment for Tailoring Automatically Untangled Changes. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, London, ON, Canada, 657–661. https://doi.org/10.1109/SANER48275.2020.9054861
- Is the Ground Truth Really Accurate? Dataset Purification for Automated Program Repair. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 96–107. https://doi.org/10.1109/SANER50967.2021.00018 ISSN: 1534-5351.
- Improving vulnerability inspection efficiency using active learning. IEEE Transactions on Software Engineering 47, 11 (2019), 2401–2420.
- Improving Vulnerability Inspection Efficiency Using Active Learning. IEEE Transactions on Software Engineering 47, 11 (Nov. 2021), 2401–2420. https://doi.org/10.1109/TSE.2019.2949275 arXiv:1803.06545 [cs].
- D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis. arXiv:2102.07995 [cs] (Feb. 2021). http://arxiv.org/abs/2102.07995 arXiv: 2102.07995.
- Learning with Local and Global Consistency. ([n. d.]).
- Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 915. Curran Associates Inc., Red Hook, NY, USA, 10197–10207.
- Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation.
- Ashwin Kallingal Joshy (6 papers)
- Mirza Sanjida Alam (2 papers)
- Shaila Sharmin (5 papers)
- Qi Li (354 papers)
- Wei Le (24 papers)