Causative Insights into Open Source Software Security using Large Language Code Embeddings and Semantic Vulnerability Graph (2401.07035v1)
Abstract: Open Source Software (OSS) security and resilience are worldwide phenomena hampering economic and technological innovation. OSS vulnerabilities can cause unauthorized access, data breaches, network disruptions, and privacy violations, rendering any benefits worthless. While recent deep-learning techniques have shown great promise in identifying and localizing vulnerabilities in source code, it is unclear how effective these research techniques are from a usability perspective due to a lack of proper methodological analysis. Usually, these methods offload a developer's task of classifying and localizing vulnerable code; still, a reasonable study to measure the actual effectiveness of these systems to the end user has yet to be conducted. To address the challenge of proper developer training from the prior methods, we propose a system to link vulnerabilities to their root cause, thereby intuitively educating the developers to code more securely. Furthermore, we provide a comprehensive usability study to test the effectiveness of our system in fixing vulnerabilities and its capability to assist developers in writing more secure code. We demonstrate the effectiveness of our system by showing its efficacy in helping developers fix source code with vulnerabilities. Our study shows a 24% improvement in code repair capabilities compared to previous methods. We also show that, when trained by our system, on average, approximately 9% of the developers naturally tend to write more secure code with fewer vulnerabilities.
- idetect for vulnerability detection in internet of things operating systems using machine learning. Scientific Reports, 12(1):1–12, 2022.
- Sok: Taxonomy of attacks on open-source software supply chains. In 2023 IEEE Symposium on Security and Privacy (SP), pages 1509–1526. IEEE, 2023.
- Perspectives on the solarwinds incident. IEEE Security & Privacy, 19(2):7–13, 2021.
- Log4j, https://nvd.nist.gov/vuln/detail/CVE-2021-44228.
- Synopsys. Open source security and risk analysis report. 2023.
- US Government. Federal register. 2023.
- How secure is code generated by chatgpt? arXiv preprint arXiv:2304.09655, 2023.
- Security implications of large language model code assistants: A user study. arXiv preprint arXiv:2208.09727, 2022.
- Jukka Niiranen. Democratizing code, https://jukkaniiranen.com/2021/04/democratizing-code/.
- AKILEK Akilek Consulting. Democratizing programming: How ai enables everyone to become a programmer, https://www.linkedin.com/pulse/democratizing-programming-how-ai-enables-everyone-become/.
- Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE, 2022.
- Developers are not the enemy!: The need for usable security apis. IEEE Security & Privacy, 14(5):40–46, 2016.
- You are not your developer, either: A research agenda for usable security and privacy research beyond end users. 2016 IEEE Cybersecurity Development (SecDev), pages 3–8, 2016.
- ’think secure from the beginning’ a survey with software developers. In Proceedings of the 2019 CHI conference on human factors in computing systems, pages 1–13, 2019.
- From needs to actions to secure apps? the effect of requirements and developer practices on app security. In 29th USENIX security symposium (USENIX security 20), pages 289–305, 2020.
- Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019.
- Vuldeepecker: A deep learning-based system for vulnerability detection. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018.
- Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing, 2021.
- ReGVD: Revisiting graph neural networks for vulnerability detection. In Deep Learning for Code Workshop, 2022.
- An unbiased transformer source code learning with semantic vulnerability graph. In 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), pages 144–159, Los Alamitos, CA, USA, jul 2023. IEEE Computer Society.
- Graphcodebert: Pre-training code representations with data flow. CoRR, abs/2009.08366, 2020.
- Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356. IEEE, 2023.
- Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5131–5140, 2023.
- Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering, 49(1):147–165, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304, 2023.
- Forward integrity for secure audit logs. Technical report, Citeseer, 1997.
- Employing attack graphs for intrusion detection. In Proceedings of the New Security Paradigms Workshop, pages 16–30, 2019.
- Unicorn: Runtime provenance-based detector for advanced persistent threats. arXiv preprint arXiv:2001.01525, 2020.
- Rain: Refinable attack investigation with on-demand inter-process information flow tracking. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 377–390, 2017.
- Dta++: dynamic taint analysis with targeted control-flow propagation. In NDSS, 2011.
- On the relationship between software complexity and security. arXiv preprint arXiv:2002.07135, 2020.
- Infer. Infer. 2013.
- Cppcheck. https://cppcheck.sourceforge.io/. 2022.
- Linevul: A transformer-based line-level vulnerability prediction. 03 2022.
- Deeplinedp: Towards a deep learning approach for line-level defect prediction. IEEE Transactions on Software Engineering, 2022.
- Vulchecker: Graph-based vulnerability localization in source code.
- Vulrepair: a t5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 935–947, 2022.
- Velvet: a novel ensemble learning approach to automatically locate vulnerable statements. arXiv preprint arXiv:2112.10893, 2021.
- CWE. Common weakness enumeration. 2022.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics.
- Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356, 2019.
- Every document owns its structure: Inductive text classification via graph neural networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 334–339, 2020.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
- Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR, 2017.
- Ac/c++ code vulnerability dataset with code changes and cve summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, pages 508–512, 2020.
- D2a: a dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 111–120. IEEE, 2021.
- Joern: The bug hunters workbench. https://joern.io/.
- Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
- Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 292–303, 2021.
- Committed to trust: A qualitative study on security & trust in open source software projects. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1880–1896. IEEE, 2022.
- {{\{{V1SCAN}}\}}: Discovering 1-day vulnerabilities in reused {{\{{C/C++}}\}} open-source software components using code classification techniques. In 32nd USENIX Security Symposium (USENIX Security 23), pages 6541–6556, 2023.
- Software vulnerability detection using deep neural networks: a survey. Proceedings of the IEEE, 108(10):1825–1848, 2020.
- μ𝜇\muitalic_μ vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Transactions on Dependable and Secure Computing, 18(5):2224–2236, 2019.
- Vulberta: Simplified source code pre-training for vulnerability detection. CoRR, abs/2205.12424, 2022.
- Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering, 2021.
- Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy, pages 590–604. IEEE, 2014.
- Vulnerability prediction from source code using machine learning. IEEE Access, 8:150672–150684, 2020.
- Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Transactions on Dependable and Secure Computing, 18(5):2469–2485, 2019.
- Towards making deep learning-based vulnerability detectors robust. arXiv preprint arXiv:2108.00669, 2021.
- Automatic feature learning for predicting vulnerable software components. IEEE Transactions on Software Engineering, 47(1):67–85, 2018.
- Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020.
- Transformer-based language models for software vulnerability detection: Performance, model’s security and platforms. arXiv preprint arXiv:2204.03214, 2022.
- Flawfinder. https://dwheeler.com/flawfinder/. 2002.
- RATS. Rats. 2023.
- Fabian Yamaguchi. Pattern-based vulnerability discovery. 2015.
- Autopag: towards automated software patch generation with source code root cause identification and repair. In Proceedings of the 2nd ACM symposium on Information, computer and communications security, pages 329–340, 2007.
- Isolating failure-inducing thread schedules. In Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis, pages 210–220, 2002.
- Failure sketching: A technique for automated root cause diagnosis of in-production failures. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 344–360, 2015.
- Arcus: Symbolic root cause analysis of exploits in production systems. In USENIX Security Symposium, pages 1989–2006, 2021.
- Automated bug hunting with data-driven symbolic root cause analysis. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 320–336, 2021.
- Less is more: supporting developers in vulnerability detection during code review. In Proceedings of the 44th International Conference on Software Engineering, pages 1317–1329, 2022.
- Towards a prototype based explainable javascript vulnerability prediction model. In 2021 International Conference on Code Quality (ICCQ), pages 15–25. IEEE, 2021.
- Explainability-based debugging of machine learning for vulnerability discovery. In Proceedings of the 17th International Conference on Availability, Reliability and Security, pages 1–8, 2022.
- Explainable software vulnerability detection based on attention-based bidirectional recurrent neural networks. In 2020 IEEE International Conference on Big Data (Big Data), pages 4651–4656. IEEE, 2020.
- Asm2seq: Explainable assembly code functional summary generation for reverse engineering and vulnerability analysis. Digital Threats: Research and Practice, 2023.
- Vulanalyzer: Explainable binary vulnerability detection with multi-task learning and attentional graph convolution. ACM Transactions on Privacy and Security, 26(3):1–25, 2023.
- Nafis Tanveer Islam (8 papers)
- Gonzalo De La Torre Parra (3 papers)
- Dylan Manual (1 paper)
- Murtuza Jadliwala (28 papers)
- Peyman Najafirad (33 papers)