LLMDFA: Analyzing Dataflow in Code with Large Language Models (2402.10754v2)

Published 16 Feb 2024 in cs.PL, cs.LG, and cs.SE

Abstract: Dataflow analysis is a fundamental code analysis technique that identifies dependencies between program values. Traditional approaches typically necessitate successful compilation and expert customization, hindering their applicability and usability for analyzing uncompilable programs with evolving analysis needs in real-world scenarios. This paper presents LLMDFA, an LLM-powered compilation-free and customizable dataflow analysis framework. To address hallucinations for reliable results, we decompose the problem into several subtasks and introduce a series of novel strategies. Specifically, we leverage LLMs to synthesize code that outsources delicate reasoning to external expert tools, such as using a parsing library to extract program values of interest and invoking an automated theorem prover to validate path feasibility. Additionally, we adopt a few-shot chain-of-thought prompting to summarize dataflow facts in individual functions, aligning the LLMs with the program semantics of small code snippets to mitigate hallucinations. We evaluate LLMDFA on synthetic programs to detect three representative types of bugs and on real-world Android applications for customized bug detection. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35. We have open-sourced LLMDFA at https://github.com/chengpeng-wang/LLMDFA.

Summary

The paper presents LLMDFA, a framework that leverages large language models to analyze code dataflow without requiring full compilations.
The methodology employs source/sink extraction, chain-of-thought guided dataflow summarization, and SMT-based path feasibility validation.
Evaluations on divide-by-zero and XSS detection tasks show that LLMDFA outperforms traditional static analysis tools in precision and recall.

The paper "When Dataflow Analysis Meets LLMs" introduces LLMDFA, a novel framework that uses LLMs to perform dataflow analysis on code. Dataflow analysis is a technique used to understand how data values propagate through a program, and it's useful for tasks like code optimization and bug detection.

The authors address limitations in traditional dataflow analysis techniques, which often require complete, compilable programs and significant manual customization for specific applications. LLMDFA, in contrast, can analyze code snippets without needing a full compilation environment and can automatically adapt to different downstream tasks.

The core idea behind LLMDFA is to leverage the ability of LLMs to understand and interpret code. The framework breaks down the dataflow analysis problem into three phases:

Source/Sink Extraction: Identifying the starting and ending points of data flows relevant to the analysis task (e.g., variables that could be zero for divide-by-zero detection, or potential sources and destinations of data in a cross-site scripting scenario). LLMDFA synthesizes script programs that leverage parsing libraries to extract sources and sinks precisely.
Dataflow Summarization: Determining how data flows within individual functions, creating summaries of the data dependencies. LLMDFA employs few-shot learning with Chain-of-Thought (CoT) prompting to summarize potential dataflow facts for single functions.
Path Feasibility Validation: Checking if the data flows identified are actually possible given the program's logic and control flow. The paper synthesizes script programs that encode path conditions to logical constraints and validate them using an SMT solver.

The authors evaluate LLMDFA on divide-by-zero (DBZ) and cross-site scripting (XSS) bug detection tasks using the Juliet Test Suite. The results show that LLMDFA achieves high precision and recall, outperforming existing static analysis tools. The paper also includes ablation studies to demonstrate the effectiveness of the different components of LLMDFA.

In essence, the paper proposes a new way to perform dataflow analysis that is more flexible and easier to customize by harnessing the power of LLMs and external tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1759567181577097438