E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification (2312.08477v1)
Abstract: Static analysis, the process of examining code without executing it, is crucial for identifying software issues. Yet, static analysis is hampered by its complexity and the need for customization for different targets. Traditional static analysis tools require extensive human effort and are often limited to specific target programs and programming languages. Recent advancements in LLMs, such as GPT-4 and Llama, offer new capabilities for software engineering tasks. However, their application in static analysis, especially in understanding complex code structures, remains under-explored. This paper introduces a novel approach named E&V , which leverages LLMs to perform static analysis. Specifically, E&V employs LLMs to simulate the execution of pseudo-code, effectively conducting static analysis encoded in the pseudo-code with minimal human effort, thereby improving the accuracy of results. E&V includes a verification process for pseudo-code execution without needing an external oracle. This process allows E&V to mitigate hallucinations of LLMs and enhance the accuracy of static analysis results. We have implemented E&V in a prototype tool designed for triaging crashes through backward taint analysis. This prototype, paired with GPT-4-32k, has been applied to triage 170 recently fixed Linux kernel bugs across seven bug categories. Our experiments demonstrate that the prototype correctly identifies the blamed function in 81.2% of the cases. Additionally, we observe that our novel verification process significantly improves the accuracy, increasing it from 28.2% to 81.2%.
- Improving Few-Shot Prompts with Relevant Static Analysis Products. arXiv:2304.06815 [cs.SE]
- AntonOsika. 2023. GPT Engineer. https://github.com/AntonOsika/gpt-engineer
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR abs/2303.12712 (2023). https://doi.org/10.48550/arXiv.2303.12712 arXiv:2303.12712
- Ranking LLM-Generated Loop Invariants for Program Verification. arXiv:2310.09342 [cs.PL]
- Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
- FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. CoRR abs/2307.13528 (2023). https://doi.org/10.48550/arXiv.2307.13528 arXiv:2307.13528
- RETracer: triaging crashes by reverse execution from partial memory dumps. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, Laura K. Dillon, Willem Visser, and Laurie A. Williams (Eds.). ACM, 820–831. https://doi.org/10.1145/2884781.2884844
- Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE]
- ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We? arXiv:2310.09810 [cs.SE]
- Google. 2023. Syzbot. https://syzkaller.appspot.com/upstream
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. CoRR abs/2305.11738 (2023). https://doi.org/10.48550/arXiv.2305.11738 arXiv:2305.11738
- Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
- Large Language Models for Software Engineering: A Systematic Literature Review. CoRR abs/2308.10620 (2023). https://doi.org/10.48550/ARXIV.2308.10620 arXiv:2308.10620
- Large Language Models Cannot Self-Correct Reasoning Yet. CoRR abs/2310.01798 (2023). https://doi.org/10.48550/arXiv.2310.01798 arXiv:2310.01798
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv:2311.05232 [cs.CL]
- Andrej Karpathy. 2023. Intro to Large Language Models. https://www.youtube.com/watch?v=zjkBMFhNj_g
- Linux Kernel. 2023. The Kernel Address Sanitizer (KASAN). https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
- Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 75–88. https://doi.org/10.1109/CGO.2004.1281665
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
- The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models. arXiv:2308.00245 [cs.SE]
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE]
- The Scope of ChatGPT in Software Engineering: A Thorough Investigation. CoRR abs/2305.12138 (2023). https://doi.org/10.48550/ARXIV.2305.12138 arXiv:2305.12138
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896 [cs.CL]
- Augmented Language Models: a Survey. arXiv:2302.07842 [cs.CL]
- Microsoft. 2023. Azure. https://azure.microsoft.com
- Prompting with Pseudo-Code Instructions. CoRR abs/2305.11790 (2023). https://doi.org/10.48550/arXiv.2305.11790 arXiv:2305.11790
- Aleksandr Nogikh. 2023. Syzbot: 7 years of continuous kernel fuzzing. https://lpc.events/event/17/contributions/1521/
- OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774
- Archit Parnami and Minwoo Lee. 2022. Learning from Few Examples: A Summary of Approaches to Few-Shot Learning. arXiv:2203.04291 [cs.LG]
- Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. arXiv:2307.11019 [cs.CL]
- An Empirical Study of Using Large Language Models for Unit Test Generation. arXiv:2305.00418 [cs.SE]
- Significant-Gravitas. 2023. AutoGPT: the heart of the open-source agent ecosystem. https://github.com/Significant-Gravitas/AutoGPT
- GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems. arXiv:2310.12397 [cs.AI]
- Automatic Code Summarization via ChatGPT: How Far Are We? arXiv:2305.12865 [cs.SE]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
- Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change). CoRR abs/2206.10498 (2022). https://doi.org/10.48550/arXiv.2206.10498 arXiv:2206.10498
- Can Large Language Models Really Improve by Self-critiquing Their Own Plans? arXiv:2310.08118 [cs.AI]
- Software Testing with Large Language Model: Survey, Landscape, and Vision. arXiv:2307.07221 [cs.SE]
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Lilian Weng. 2023. LLM powered Autonomous Agents. lilianweng.github.io (Jun 2023). https://lilianweng.github.io/posts/2023-06-23-agent/
- Large Language Models are Better Reasoners with Self-Verification. arXiv:2212.09561 [cs.AI]
- The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv:2309.07864 [cs.AI]
- Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for 0.42 each using ChatGPT. arXiv:2304.00385 [cs.SE]
- Large Language Models as Optimizers. arXiv:2309.03409 [cs.LG]
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]
- Decoding Methods in Neural Language Generation: A Survey. Inf. 12, 9 (2021), 355. https://doi.org/10.3390/INFO12090355
- Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR abs/2309.01219 (2023). https://doi.org/10.48550/arXiv.2309.01219 arXiv:2309.01219
- Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL]
- Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910 [cs.LG]
- Yu Hao (32 papers)
- Weiteng Chen (3 papers)
- Ziqiao Zhou (4 papers)
- Weidong Cui (4 papers)