Automated Repair of AI Code with Large Language Models and Formal Verification (2405.08848v1)
Abstract: The next generation of AI systems requires strong safety guarantees. This report looks at the software implementation of neural networks and related memory safety properties, including NULL pointer deference, out-of-bound access, double-free, and memory leaks. Our goal is to detect these vulnerabilities, and automatically repair them with the help of LLMs. To this end, we first expand the size of NeuroCodeBench, an existing dataset of neural network code, to about 81k programs via an automated process of program mutation. Then, we verify the memory safety of the mutated neural network implementations with ESBMC, a state-of-the-art software verifier. Whenever ESBMC spots a vulnerability, we invoke a LLM to repair the source code. For the latest task, we compare the performance of various state-of-the-art prompt engineering techniques, and an iterative approach that repeatedly calls the LLM.
- Prompt engineering. https://platform.openai.com/docs/guides/prompt-engineering/tactic-use-code-execution-to-perform-more-accurate-calculations-or-call-external-apis. Online; Accessed: 30th April 2024.
- Y. Abu-Mostafa and J. St. Jacques. Information capacity of the hopfield model. IEEE Transactions on Information Theory, 31(4):461–464, 1985.
- M. AI and L. Foundation. Pytorch, 2023. https://pytorch.org/ [Accessed: 31 August 2023].
- S. Biderman and W. J. Scheirer. Pitfalls in machine learning research: Reexamining the development cycle. In J. Zosa Forde, F. Ruiz, M. F. Pradier, and A. Schein, editors, Proceedings on ”I Can’t Believe It’s Not Better!” at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 106–117. PMLR, 12 Dec 2020.
- G. Brain. Tensorflow, 2023. https://www.tensorflow.org [Accessed: 31 August 2023].
- The Fourth International Verification of Neural Networks Competition (VNN-COMP 2023): Summary and Results, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A new era in software security: Towards self-healing software via large language models and formal verification. arXiv preprint arXiv:2305.14752, 2023.
- R. Chaudhuri and I. Fiete. Bipartite expander hopfield networks as self-decoding high-capacity error correcting codes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023.
- C. Community. 2023 cwe top 25 most dangerous software weaknesses, 2023. https://cwe.mitre.org/top25/archive/2023/2023_top25_list.html [Accessed: 25 August 2023].
- O. Community. Open neural network exchange: The open standard for machine learning interoperability, 2023. https://onnx.ai/ [Accessed: 31 August 2023].
- R. Conlin. keras2c github repository, 2023. https://github.com/f0uriest/keras2c [Accessed: 25 August 2023].
- Keras2c: A library for converting keras neural networks to real-time compatible c. Engineering Applications of Artificial Intelligence, 100:104182, 2021.
- L. Cordeiro. Software security.
- SMT-Based Bounded Model Checking for Embedded ANSI-C Software, July 2009. arXiv:0907.2072 [cs].
- C. Daws and S. Tripakis. Model checking of real-time reachability properties using abstractions. In B. Steffen, editor, Tools and Algorithms for the Construction and Analysis of Systems, pages 313–329, Berlin, Heidelberg, 1998. Springer Berlin Heidelberg.
- Differential testing of cross deep learning framework APIs: Revealing inconsistencies and vulnerabilities. In 32nd USENIX Security Symposium (USENIX Security 23), pages 7393–7410, Anaheim, CA, Aug. 2023. USENIX Association.
- A. Denisov and S. Pankevich. Mull it over: Mutation testing based on llvm. In 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 25–31, April 2018.
- Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE, 2023.
- Exploring the limits of out-of-distribution detection. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 7068–7081. Curran Associates, Inc., 2021.
- Audee: Automated testing for deep learning frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, page 486–498, New York, NY, USA, 2021. Association for Computing Machinery.
- A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Computer Science Review, 37:100270, 2020.
- Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1110–1121, New York, NY, USA, 2020. Association for Computing Machinery.
- Neural code completion. 2016.
- NeuroCodeBench: a Plain C Neural Network Benchmark for Software Verification. In Workshop on Automated Formal Reasoning for Trustworthy AI Systems, 2023.
- Deep learning-based wave digital modeling of rate-dependent hysteretic nonlinearities for virtual analog applications. EURASIP Journal on Audio, Speech, and Music Processing, 2023(1):12, Mar 2023.
- Learning density distribution of reachable states for autonomous systems. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 124–136. PMLR, 08–11 Nov 2022.
- M. Miller. Trends and challenges in the vulnerability mitigation landscape. USENIX Association, 2019.
- Bugs in machine learning-based systems: a faultload benchmark. Empirical Software Engineering, 28(3):62, Apr 2023.
- The third international verification of neural networks competition (vnn-comp 2022): Summary and results, 2023.
- N. Narodytska and S. Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310–1318, 2017.
- TensorFuzz: Debugging neural networks with coverage-guided fuzzing. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4901–4911. PMLR, 09–15 Jun 2019.
- Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
- Safe reinforcement learning benchmark environments for aerospace control systems. In 2022 IEEE Aerospace Conference (AERO), pages 1–20, 2022.
- H. Suresh and J. Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’21, New York, NY, USA, 2021. Association for Computing Machinery.
- K. User. onnx2c github repository, 2023. https://github.com/kraiskil/onnx2c [Accessed: 25 August 2023].
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
- Neural-based dynamic modeling of nonlinear microwave circuits. IEEE Transactions on Microwave Theory and Techniques, 50(12):2769–2780, 2002.