Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities (2402.17230v1)
Abstract: Security vulnerabilities are increasingly prevalent in modern software and they are widely consequential to our society. Various approaches to defending against these vulnerabilities have been proposed, among which those leveraging deep learning (DL) avoid major barriers with other techniques hence attracting more attention in recent years. However, DL-based approaches face critical challenges including the lack of sizable and quality-labeled task-specific datasets and their inability to generalize well to unseen, real-world scenarios. Lately, LLMs have demonstrated impressive potential in various domains by overcoming those challenges, especially through chain-of-thought (CoT) prompting. In this paper, we explore how to leverage LLMs and CoT to address three key software vulnerability analysis tasks: identifying a given type of vulnerabilities, discovering vulnerabilities of any type, and patching detected vulnerabilities. We instantiate the general CoT methodology in the context of these tasks through VSP , our unified, vulnerability-semantics-guided prompting approach, and conduct extensive experiments assessing VSP versus five baselines for the three tasks against three LLMs and two datasets. Results show substantial superiority of our CoT-inspired prompting (553.3%, 36.5%, and 30.8% higher F1 accuracy for vulnerability identification, discovery, and patching, respectively, on CVE datasets) over the baselines. Through in-depth case studies analyzing VSP failures, we also reveal current gaps in LLM/CoT for challenging vulnerability cases, while proposing and validating respective improvements.
- 2022 cwe top 25 most dangerous software weaknesses. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html, 2022.
- Compilers: Principles, techniques, and tools, 2006.
- Comparing the effectiveness of penetration testing and static code analysis on the detection of SQL injection vulnerabilities in web services. In Pacific Rim International Symposium on Dependable Computing, pages 301–306, 2009.
- Automation of vulnerability classification from its description using machine learning. In 2020 IEEE Symposium on Computers and Communications (ISCC), pages 1–7. IEEE, 2020.
- A comparison of the efficiency and effectiveness of vulnerability discovery techniques. Information and Software Technology, 55(7):1279–1288, 2013.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Paul E Black et al. SARD: A software assurance reference dataset. In Anonymous Cybersecurity Innovation Forum.(), 2017.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Practical memory checking with dr. memory. In International Symposium on Code Generation and Optimization (CGO 2011), pages 213–223. IEEE, 2011.
- Undangle: early detection of dangling pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software Testing and Analysis, pages 133–143, 2012.
- Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering (TSE), 2021.
- Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering, 49(1):147–165, 2022.
- Detecting kernel memory leaks in specialized modules with ownership reasoning. In The 2021 Annual Network and Distributed System Security Symposium (NDSS’21), 2021.
- Ericsson. Software vulnerability: Impact & ways to avoid it. https://www.ericsson.com/en/security/vulnerability-management, 2023.
- Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533, 2023.
- A c/c++ code vulnerability dataset with code changes and cve summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR), pages 508–512, 2020.
- Prompting is all your need: Automated android bug replay with large language models. arXiv preprint arXiv:2306.01987, 2023.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- Forbes Technology Council. Zero-day vulnerabilities: 17 consequences and complications. https://www.forbes.com/sites/forbestechcouncil/2023/05/26/zero-day-vulnerabilities-17-consequences-and-complications/?sh=711e37204b41, 2023.
- LineVul: a transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR), pages 608–620, 2022.
- VulRepair: a t5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 935–947, 2022.
- FlowDist:multi-staged refinement-based dynamic information flow analysis for distributed software systems. In 30th USENIX Security Symposium (USENIX Security 21), pages 2093–2110, 2021.
- Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, 1994.
- Beyond tests: Program vulnerability repair via crash constraint extraction. ACM Transactions on Software Engineering and Methodology (TOSEM), 30(2):1–27, 2021.
- google. A Coverage-Guided, Native Python Fuzzer. https://github.com/google/atheris, 2022.
- Dowsing for Overflows: A guided fuzzer to find buffer boundary violations. In 22nd USENIX Security Symposium (USENIX Security 13), pages 49–64, Washington, D.C., August 2013. USENIX Association.
- Dynamic detection of inter-application communication vulnerabilities in android. In Proceedings of the 2015 International Symposium on Software Testing and Analysis, pages 118–128, 2015.
- Controlling large language models to generate secure and vulnerable code. arXiv preprint arXiv:2302.05319, 2023.
- You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content. arXiv preprint arXiv:2308.05596, 2023.
- LineVD: statement-level vulnerability detection using graph neural networks. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR), pages 596–607, 2022.
- The secret life of software vulnerabilities: A large-scale empirical study. IEEE Transactions on Software Engineering, 49(1):44–63, 2022.
- Ilan Peleg. The high cost of security vulnerabilities. https://www.forbes.com/sites/forbesbusinesscouncil/2023/04/10/the-high-cost-of-security-vulnerabilities-why-observability-is-the-solution/?sh=90da08612ae6, 2023.
- Information Technology Laboratory at NIST. National vulnerability database (nvd) dashboard. https://nvd.nist.gov/general/nvd-dashboard, 2023.
- Why don’t software developers use static analysis tools to find bugs? In 2013 35th International Conference on Software Engineering (ICSE), pages 672–681. IEEE, 2013.
- Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5131–5140, 2023.
- Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2312–2323. IEEE, 2023.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- CBMC–c bounded model checker. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 389–391, 2014.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- {{\{{Type-Assisted}}\}} dynamic buffer overflow detection. In 11th USENIX Security Symposium (USENIX Security 02), 2002.
- The hitchhiker’s guide to program analysis: A journey with large language models. arXiv preprint arXiv:2308.00245, 2023.
- A comparative study on software vulnerability static analysis techniques and tools. In International Conference on Information Theory and Information Security, pages 521–524, 2010.
- PCA: memory leak detection using partial call-path analysis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE-Demo), pages 1621–1625, 2020.
- {{\{{PolyCruise}}\}}: A {{\{{Cross-Language}}\}} dynamic information flow analysis. In 31st USENIX Security Symposium (USENIX Security 22), pages 2513–2530, 2022.
- PolyFuzz: Holistic greybox fuzzing of multi-language systems. In 32nd USENIX Security Symposium (USENIX Security 23), 2023.
- Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 292–303, 2021.
- Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing, 2021.
- Vuldeepecker: A deep learning-based system for vulnerability detection. In Network and Distributed System Security (NDSS) Symposium, 2018.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Vurle: Automatic vulnerability detection and repair by learning from examples. In Computer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part II 22, pages 229–246. Springer, 2017.
- The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering, 47(11):2312–2331, 2019.
- VulChecker: Graph-based vulnerability localization in source code. In 32nd USENIX Security Symposium (USENIX Security 23), pages 6557–6574, Anaheim, CA, August 2023. USENIX Association.
- M.Zalewski. Technical "whitepaper" for afl-fuzz. https://lcamtuf.coredump.cx/afl/technical_details.txt, 2014.
- National Institute of Standards and Technology (NIST). National Vulnerability Database (NVD). https://nvd.nist.gov, 2022.
- Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM Sigplan notices, 42(6):89–100, 2007.
- Binary-level directed fuzzing for {{\{{Use-After-Free}}\}} vulnerabilities. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), pages 47–62, 2020.
- David Noever. Can large language models find and fix vulnerable software? arXiv preprint arXiv:2308.10345, 2023.
- Yu Nong and Haipeng Cai. A preliminary study on open-source memory vulnerability detectors. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 557–561. IEEE, 2020.
- Evaluating and comparing memory error vulnerability detectors. Information and Software Technology, 137:106614, 2021.
- VGX: Large-scale sample generation for boosting learning-based software vulnerability analyses. In IEEE/ACM International Conference on Software Engineering (ICSE), 2024.
- Generating realistic vulnerabilities via neural code editing: an empirical study. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1097–1109, 2022.
- VulGen: Realistic vulnerable sample generation via pattern mining and deep learning. In IEEE/ACM International Conference on Software Engineering (ICSE), pages 2527–2539, 2023.
- Open science in software engineering: A study on deep learning-based vulnerability detection. IEEE Transactions on Software Engineering (TSE), 2022.
- Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356. IEEE, 2023.
- Skeletons in microsoft’s closet - silently fixed vulnerabilities. https://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-Manzuik.pdf, 2023.
- A practical dynamic buffer overflow detector. In NDSS, volume 2004, pages 159–169, 2004.
- {{\{{AddressSanitizer}}\}}: A fast address sanity checker. In 2012 USENIX annual technical conference (USENIX ATC 12), pages 309–318, 2012.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Typestate-guided fuzzer for discovering use-after-free vulnerabilities. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 999–1010, 2020.
- Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221, 2023.
- Tt-xss: A novel taint tracking based dynamic detection framework for dom cross-site scripting. Journal of Parallel and Distributed Computing, 118:100–106, 2018.
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Copiloting the copilots: Fusing large language models with completion engines for automated program repair. arXiv preprint arXiv:2309.00608, 2023.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- {{\{{FUZE}}\}}: Towards facilitating exploit generation for kernel {{\{{Use-After-Free}}\}} vulnerabilities. In 27th USENIX Security Symposium (USENIX Security 18), pages 781–797, 2018.
- How effective are neural networks for fixing security vulnerabilities. arXiv preprint arXiv:2305.18607, 2023.
- Vulcnn: An image-inspired scalable vulnerability detection system. In Proceedings of the 44th International Conference on Software Engineering, pages 2365–2376, 2022.
- Universal fuzzing via large language models. arXiv preprint arXiv:2308.04748, 2023.
- Few-sample named entity recognition for security vulnerability reports by fine-tuning pre-trained language models. In Deployable Machine Learning for Security Defense: Second International Workshop, MLHat 2021, Virtual Event, August 15, 2021, Proceedings 2, pages 55–78. Springer, 2021.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Example-based vulnerability detection and repair in java code. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pages 190–201, 2022.
- Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
- Fuzzing: a survey for roadmap. ACM Computing Surveys (CSUR), 54(11s):1–36, 2022.
- Falcon llm: A new frontier in natural language processing. AC Investment Research Journal, 220(44), 2023.
- Yu Nong (4 papers)
- Mohammed Aldeen (1 paper)
- Long Cheng (77 papers)
- Hongxin Hu (27 papers)
- Feng Chen (261 papers)
- Haipeng Cai (20 papers)