Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models

Published 19 Feb 2024 in cs.CR, cs.LG, cs.PL, and cs.SE | (2402.13291v2)

Abstract: The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging LLMs, which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Learning to Represent Programs with Graphs. In ICLR 2018.
  2. Self-Supervised Bug Detection and Repair. In NeurIPS 2021, virtual. 27865–27876.
  3. Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang. 3, OOPSLA (2019), 159:1–159:27.
  4. TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer. In ICML 2021, virtual (Proceedings of Machine Learning Research, Vol. 139). PMLR, 780–791.
  5. A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World. Commun. ACM 53, 2 (feb 2010), 66–75.
  6. Evaluating Large Language Models Trained on Code.
  7. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering (2019).
  8. QLoRA: Efficient Finetuning of Quantized LLMs. CoRR abs/2305.14314 (2023). https://doi.org/10.48550/ARXIV.2305.14314 arXiv:2305.14314
  9. Semantic Code Repair using Neuro-Symbolic Transformation Networks. Workshop track invitation, ICML 2018. https://openreview.net/forum?id=r1hsJCe0Z
  10. Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs. In ICLR. https://openreview.net/forum?id=SJeqs6EFvB
  11. ESLint. 2022. ESLint rules. https://eslint.org/docs/latest/rules/
  12. Sorald: Automatic Patch Suggestions for SonarQube Static Analysis Violations. CoRR abs/2103.12033 (2021). arXiv:2103.12033 https://arxiv.org/abs/2103.12033
  13. Vision Transformer-Inspired Automated Vulnerability Repair. ACM Trans. Softw. Eng. Methodol. (nov 2023). https://doi.org/10.1145/3632746 Just Accepted.
  14. Automatic Software Repair: A Survey. IEEE Trans. Software Eng. 45, 1 (2019), 34–67.
  15. Alex Graves. 2012. Sequence Transduction with Recurrent Neural Networks. In ICML 2012, Workshop on Representation Learning.
  16. DeepFix: Fixing Common C Language Errors by Deep Learning. AAAI Conference on Artificial Intelligence 31, 1 (Feb. 2017). https://ojs.aaai.org/index.php/AAAI/article/view/10742
  17. On Distribution Shift in Learning-based Bug Detectors. In ICML 2022 (Proceedings of Machine Learning Research, Vol. 162). PMLR, 8559–8580.
  18. Global Relational Models of Source Code. In ICLR 2020. https://openreview.net/forum?id=B1lnbRNtwr
  19. The Curious Case of Neural Text Degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH
  20. Towards Practical Program Repair with On-Demand Candidate Generation. In ICSE 2018. ACM, 12–23.
  21. Mistral 7B. CoRR abs/2310.06825 (2023). https://doi.org/10.48550/ARXIV.2310.06825 arXiv:2310.06825
  22. Mixtral of Experts. CoRR abs/2401.04088 (2024). https://doi.org/10.48550/ARXIV.2401.04088 arXiv:2401.04088
  23. BugBuilder: An Automated Approach to Building Bug Repository. IEEE Transactions on Software Engineering (2022), 1–1. https://doi.org/10.1109/TSE.2022.3177713
  24. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. CoRR abs/2208.11640 (2022). https://doi.org/10.48550/arXiv.2208.11640 arXiv:2208.11640
  25. StarCoder: may the source be with you! CoRR abs/2305.06161 (2023). https://doi.org/10.48550/ARXIV.2305.06161 arXiv:2305.06161
  26. Automatic inference of code transforms for patch generation. In Foundations of Software Engineering, ESEC/FSE 2017. ACM, 727–739.
  27. ENCORE: Ensemble Learning using Convolution Neural Machine Translation for Automatic Program Repair. CoRR abs/1906.08691 (2019). http://arxiv.org/abs/1906.08691
  28. SapFix: automated end-to-end repair at scale. In ICSE 2019.
  29. Angelix: scalable multiline program patch synthesis via symbolic analysis. In ICSE 2016.
  30. Microsoft. 2023a. Introducing AI-powered application security testing with GitHub Advanced Security. https://github.blog/2023-11-08-ai-powered-appsec/
  31. Microsoft. 2023b. Learn how to work with the ChatGPT and GPT-4 models. https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/chatgpt?pivots=programming-language-chat-completions
  32. Ghassan Misherghi and Zhendong Su. 2006. HDD: hierarchical Delta Debugging. In ICSE 2006. ACM, 142–151.
  33. Martin Monperrus. 2018. The Living Review on Automated Program Repair. Technical Report HAL-01956501. https://hal.archives-ouvertes.fr/hal-01956501v2/file/repair-living-review.pdf
  34. SemFix: program repair via semantic analysis. In ICSE 2013.
  35. OpenAI. 2022. GPT-3.5 Model Registry. https://platform.openai.com/docs/model-index-for-researchers/models-referred-to-as-gpt-3-5
  36. OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774
  37. OWASP Foundation. 2010. Path Traversal vulnerability description. https://owasp.org/www-community/attacks/Path_Traversal
  38. Synchromesh: Reliable Code Generation from Pre-trained Language Models. In ICLR 2022.
  39. Michael Pradel and Koushik Sen. 2018. DeepBugs: a learning approach to name-based bug detection. Proc. ACM Program. Lang. 2, OOPSLA (2018), 147:1–147:25.
  40. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 21 (2020), 140:1–140:67.
  41. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. ArXiv. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/
  42. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In SIGKDD 2020 (Virtual Event, CA, USA) (KDD ’20). ACM, New York, NY, USA, 3505–3506.
  43. Code completion with statistical language models. In PLDI 2014. ACM, 419–428.
  44. Test-case reduction for C compiler bugs. In PLDI 2012. ACM, 335–346.
  45. Lessons from Building Static Analysis Tools at Google. Commun. ACM 61, 4 (mar 2018), 58–66.
  46. SemGrep. 2023a. Autofix. https://semgrep.dev/docs/writing-rules/autofix/
  47. SemGrep. 2023b. We put GPT-4 in Semgrep to point out false positives. https://semgrep.dev/blog/2023/gpt4-and-semgrep-detailed/
  48. Is the cure worse than the disease? overfitting in automated program repair. In FSE 2015. ACM, 532–543.
  49. Snyk. 2021. SAST tools speed comparison: Snyk Code vs SonarQube and LGTM. https://snyk.io/blog/sast-tools-speed-comparison-snyk-code-sonarqube-lgtm/
  50. Snyk. 2023a. Fix code vulnerabilities automatically. https://docs.snyk.io/scan-using-snyk/snyk-code/exploring-and-working-with-snyk-code-results-in-the-web-ui/fix-code-issues-automatically-with-deepcode-ai-fix-suggestions/
  51. Snyk. 2023b. Snyk Code - Developer-focused, real-time SAST. https://snyk.io/product/snyk-code/
  52. Neural Program Repair by Jointly Learning to Localize and Repair. In ICLR 2019.
  53. Veracode. 2023. Veracode Fix. https://www.veracode.com/fix
  54. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP 2021. ACL, 8696–8708.
  55. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs. IEEE Trans. Software Eng. 43 (2017), 34–55.
  56. A comprehensive study of automatic program repair on the QuixBugs benchmark. J. Syst. Softw. 171 (2021).
  57. Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Trans. Software Eng. 28, 2 (2002), 183–200. https://doi.org/10.1109/32.988498
Citations (6)

Summary

  • The paper's main contribution is applying a novel context reduction technique to isolate essential code segments for fixing security vulnerabilities using LLMs.
  • It integrates static analysis with a modified HDD approach to accurately extract the bug context and streamline the automated repair process.
  • Experimental results demonstrate that the CodeReduce technique significantly improves both functional (Pass@k) and syntactic (ExactMatch@k) correctness compared to traditional baselines.

DeepCode AI Fix: Fixing Security Vulnerabilities with LLMs

The paper "DeepCode AI Fix: Fixing Security Vulnerabilities with LLMs" (2402.13291) details an innovative approach to using LLMs for automatic bug fixing, particularly focusing on security vulnerabilities. The traditional challenges faced in automated program repair (APR) are addressed through a novel context reduction technique that leverages static analysis, thus enhancing the problem-solving capacity of LLMs in complex programming tasks. This essay provides a comprehensive overview of the methodology, experimental evaluation, and key contributions of the paper.

Problem Statement and Solution Approach

The paper identifies two primary challenges in using LLMs for automated bug fixing: the need to learn long-distance code relationships and the scarcity of high-quality datasets for complex semantic bugs. To tackle these, the authors propose the integration of code reduction techniques to minimize the context required by the LLM, thereby improving the learning and inference process.

Code Reduction Technique:

The core idea is to apply program analysis to isolate the essential code segments necessary for fixing a particular bug. By limiting the LLM's attention to these segments, the model can be trained and fine-tuned more efficiently, requiring significantly less data. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Pipeline for automatic bug fixing with (a) only a LLM or (b) with the complete process proposed by DeepCode AI Fix combining CodeReduce with a LLM.

Implementation Details

The implementation of DeepCode AI Fix involves several critical components:

  1. Static Code Analysis: The system uses Snyk Code for a comprehensive static analysis, capable of detecting a wide range of security vulnerabilities. The analysis identifies program paths and data flows that are crucial for understanding the bug context.
  2. CodeReduce Algorithm: This is a modified version of Hierarchical Delta Debugging (HDD) applied to abstract syntax trees (AST). By iteratively removing non-essential nodes from the AST, a minimal version of the code containing the bug is generated. This reduced code snippet preserves the original bug's characteristics but within a simplified context.
  3. MergeBack: After the LLM generates a potential fix on the reduced code, the changes are merged back into the original codebase. This step ensures that fixes are applicable to the full program context without disrupting existing functionality.

Experimental Evaluation

The evaluation involves comparing DeepCode AI Fix against baseline models using different configurations of context extraction. Notably, models trained with CodeReduce demonstrated superior performance across all metrics, especially for complex security issues requiring profound contextual understanding.

Performance Metrics:

  • Pass@k: Measures the percentage of test cases passing after applying the fix generated by the LLM, indicating functional correctness.
  • ExactMatch@k: Assesses the syntactic correctness by checking if the LLM's fix matches the human-provided fix exactly.

The approaches using CodeReduce outperformed traditional baselines, including those using the full context or limited line windows. This indicates a substantial improvement in both the efficiency and accuracy of fixing security vulnerabilities.

Implications and Future Work

The integration of static analysis with LLMs marks a significant advancement in automated program repair, particularly in handling complex vulnerabilities that require understanding intricate code relationships. This approach not only enhances the fix accuracy but also reduces the computational load on LLMs.

Future developments could explore extending these techniques to other programming languages and potentially integrating runtime analysis for even more robust bug detection and fixing mechanisms. Additionally, improving the MergeBack algorithm holds promise for further increasing the applicability and reliability of the proposed system.

Conclusion

"DeepCode AI Fix: Fixing Security Vulnerabilities with LLMs" provides a robust framework for augmenting the capabilities of LLMs in the field of automated program repair. By concentrating the model's attention on the most relevant code segments, it achieves remarkable improvements in both efficiency and effectiveness, promising a more secure and streamlined process for software vulnerability management.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.