DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection (2410.18479v1)
Abstract: Software vulnerabilities represent one of the most pressing threats to computing systems. Identifying vulnerabilities in source code is crucial for protecting user privacy and reducing economic losses. Traditional static analysis tools rely on experts with knowledge in security to manually build rules for operation, a process that requires substantial time and manpower costs and also faces challenges in adapting to new vulnerabilities. The emergence of pre-trained code LLMs has provided a new solution for automated vulnerability detection. However, code pre-training models are typically based on token-level large-scale pre-training, which hampers their ability to effectively capture the structural and dependency relationships among code segments. In the context of software vulnerabilities, certain types of vulnerabilities are related to the dependency relationships within the code. Consequently, identifying and analyzing these vulnerability samples presents a significant challenge for pre-trained models. In this paper, we propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks, named DFEPT, which provides effective vulnerability data flow information to pre-trained models. Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG. By applying graph learning techniques, we embed the data flow graph and incorporate relative positional information into the graph embedding using sine positional encoding to ensure the completeness of vulnerability data flow information. Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
- 2024. Browse vulnerabilities by date. https://www.cvedetails.com/browse-by-date.php Accessed 2024-01-14.
- 2024. List of data breaches. https://en.wikipedia.org/wiki/List_of_data_breaches Accessed 2024-01-14.
- Silvio Cesare. 2013. Detecting bugs using decompilation and data flow analysis. Black Hat USA (2013), 1193–1206.
- Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering (2021).
- Deepwukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3 (2021), 1–33.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
- VRust: Automated Vulnerability Detection for Solana Smart Contracts. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 639–652.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 959–970.
- Favocado: Fuzzing the Binding Code of JavaScript Engines Using Semantically Correct Test Cases.. In NDSS.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Michael Fu and Chakkrit Tantithamthavorn. 2022. Linevul: A transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories. 608–620.
- Neural message passing for quantum chemistry. In International conference on machine learning. PMLR, 1263–1272.
- Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5-6 (2005), 602–610.
- Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
- Hazim Hanif and Sergio Maffeis. 2022. Vulberta: Simplified source code pre-training for vulnerability detection. In 2022 International joint conference on neural networks (IJCNN). IEEE, 1–8.
- https://tree-sitter.github.io/tree sitter/. 2024. tree-sitter: An Incremental Parsing System for Programming Tools. Accessed: 2024-01-13.
- Learning and evaluating contextual embedding of source code. In International conference on machine learning. PMLR, 5110–5121.
- TRACER: signature-based static analysis for detecting recurring vulnerabilities. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1695–1708.
- Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
- Predicting vulnerable software components. In Proceedings of the 14th ACM conference on Computer and communications security. 529–540.
- ReGVD: Revisiting graph neural networks for vulnerability detection. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 178–182.
- Viet Hung Nguyen and Le Minh Sang Tran. 2010. Predicting vulnerable software components with dependency graphs. In Proceedings of the 6th International Workshop on Security Measurements and Metrics. 1–8.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE transactions on software engineering 37, 6 (2010), 772–787.
- Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 166–178.
- An empirical study of deep learning models for vulnerability detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2237–2248.
- VDSimilar: Vulnerability detection based on code similarity of vulnerabilities and patches. Computers & Security 110 (2021), 102417.
- CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection. Journal of Systems and Software 199 (2023), 111623.
- Transformer-based language models for software vulnerability detection. In Proceedings of the 38th Annual Computer Security Applications Conference. 481–496.
- https://pytorch.org/. 2024. PyTorch. Accessed: 2024-01-13.
- https://www.checkmarx.com/. 2024. Checkmarx. Accessed: 2024-01-13.
- https://www.python.org/. 2024. Python 3.9.18. Accessed: 2024-01-13.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- WANA: Symbolic execution of wasm bytecode for cross-platform smart contract vulnerability detection. arXiv preprint arXiv:2007.15510 (2020).
- Defecthunter: A novel llm-driven boosted-conformer-based code vulnerability detection mechanism. arXiv preprint arXiv:2309.15324 (2023).
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
- LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability Types. IEEE Transactions on Software Engineering (2024), 1–16. https://doi.org/10.1109/TSE.2024.3382361
- How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
- Fabian Yamaguchi. 2015. Pattern-Based Vulnerability Discovery. Ph. D. Dissertation. University of Göttingen.
- Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code. IEEE Transactions on Software Engineering 49, 8 (2023), 4196–4212. https://doi.org/10.1109/TSE.2023.3286586
- D2a: A dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 111–120.
- Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems 32 (2019).
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.