LLMDFA: LLM-Driven Dataflow Analysis

Updated 14 June 2026

LLMDFA is a methodology that leverages LLMs as autonomous code interpreters, replacing traditional compilation and manual module engineering.
It decomposes dataflow analysis into source/sink extraction, CoT-based summarization, and SMT-driven path feasibility validation.
LLMDFA employs deterministic tools like tree-sitter and Z3 to ensure path-sensitive, interprocedural dependency accuracy while mitigating LLM hallucinations.

LLMDFA (LLM-Driven Dataflow Analysis) denotes a set of methodologies, algorithms, and frameworks that orchestrate LLMs for dataflow analysis tasks on code, with particular emphasis on eliminating traditional reliance on compilation, handcrafted modules, or static intermediate representations. LLMDFA reinterprets code analysis by leveraging LLMs as autonomous code interpreters, deploying them in conjunction with deterministic expert tools (e.g., syntax parsers, SMT solvers) to yield path-sensitive, interprocedural dependency information directly from natural source code or textual API documentation (Wang et al., 2024, Masoudian et al., 30 Mar 2026).

1. Motivation and Research Questions

LLMDFA approaches address fundamental obstacles in traditional dataflow analysis frameworks. Classical tools such as LLVM-based SVF, FlowDroid, and industrial Datalog engines (e.g., CodeFuseQuery) presuppose successfully compiling the codebase and often require substantial manual engineering—such as custom modules for extraction, source/sink definitions, and precision adjustments. This requirement limits applicability to uncompilable code and necessitates expert-level modification for evolving bug patterns or domain-specific queries.

LLMDFA frameworks pursue the following research questions:

Can LLMs accurately extract sources and sinks for downstream dataflow tasks without manual module implementation?
How can LLM-induced hallucinations be mitigated, especially during accumulation and propagation of intra-procedural facts to interprocedural summaries?
To what extent can path-sensitive soundness be achieved by integrating LLMs with constraint solvers, compensating for their limited symbolic reasoning ability?
How do LLMDFA systems compare with classical analyzers and end-to-end LLM prompting baselines in terms of precision, recall, and F1, particularly on real-world data and diverse bug-detection scenarios?

These questions frame the design and evaluation of new pipelines such as (Wang et al., 2024).

2. Methodological Decomposition and Workflow

LLMDFA systems are structured as multi-phase pipelines that decompose the dataflow analysis problem into a series of subroutines orchestrated by LLM coordination, tool synthesis, and deterministic validation. A canonical architecture (as in (Wang et al., 2024)) involves three principal phases:

Phase	Functionality	LLM Interaction
Source/Sink Extraction	Code pattern localization	Autoregressive script synthesis leveraging parsing libraries (e.g., tree-sitter) with iterative refinement cycles for correctness
Dataflow Summarization	Intra-procedural dependency inference	Few-shot chain-of-thought (CoT) prompting to produce step-by-step dataflow summaries between program variables
Path Feasibility Validation	Interprocedural path soundness checking	Script synthesis to encode path conditions as SMT (e.g., Z3) constraints, verifying feasibility and filtering propagations

This factorization is explicitly motivated to circumnavigate LLM hallucinations and to ensure that subjective reasoning (e.g., variable extraction or flow interpretation) is offloaded to scripts that can be validated by deterministic expert tools.

Phase I typically consumes a natural-language specification of code patterns (e.g., "divisor arguments of / and % operators") and a small set of annotated sample programs. The LLM is tasked with synthesizing a Python extractor utilizing, for example, tree-sitter for AST traversal. Exception feedback loops prompt the LLM to iteratively repair the script until it matches labeled data.

Phase II employs CoT prompting, where the LLM is shown multiple annotated function snippets to encourage explicit stepwise reasoning about dataflow—given candidate variable pairs (e.g., $x@\ell_9 \mapsto z@\ell_{13}$ ), the LLM explains if and how a flow exists.

Phase III synthesizes SMT encoding scripts (frequently in Python via Z3) to encode branch guards and path conditions along potential source–sink chains, ensuring only paths that correspond to feasible executions are retained in the final summary.

3. Techniques for Hallucination Mitigation and Reliability

A defining challenge in LLM-driven code analysis is tool hallucination—incorrect or unfounded model outputs. LLMDFA mitigates this through a series of decomposition strategies:

Extractor synthesis produces deterministic scripts, validated on concrete positive and negative examples.
CoT prompts break dependency reasoning into granular, traceable operations at the function level.
SMT-based feasibility validation delegates all symbolic or constraint-intensive reasoning to Z3, a proven external solver, thus eliminating the LLM's exposure to logical fallibility in such tasks.

A representative CoT prompt might instruct the model with concrete code and explicit reasoning chains, e.g.:

$S$ 2 (Wang et al., 2024)

The LLM is never asked to "guess" the answer directly; it is always scaffolded to either synthesize a script, perform an explicit reasoning sequence, or encode a formal constraint that is subsequently validated externally.

4. Integration with Expert Deterministic Tools

LLMDFA architectures consistently integrate two key expert systems:

Parsing libraries: e.g., tree-sitter, for program structure extraction and variable identification. Generated extractors operate 100% deterministically once validated on ground-truth annotated examples.
SMT solvers: e.g., Z3 with Python API bindings, for constraint modeling and path feasibility checks. The synthesized scripts encode program guards as Z3 constraints; satisfiability is queried to determine realizability of candidate flows.

Example extractor (for Java divisor sink): $S$ 3

Example Z3 path feasibility script: $S$ 4 (Wang et al., 2024)

5. Formalization of Dataflow Analysis

Key formal concepts provided in (Wang et al., 2024):

Let $G = (S, E_\ell)$ be the control-flow graph (CFG) where $S$ is the set of statements and $E_\ell(s, s')$ gives the Boolean guard under which $s'$ follows $s$ .
A dataflow fact: $a@\ell_m \hookrightarrow b@\ell_n$ iff the value of variable $a$ at line $m$ may affect $b$ at $G = (S, E_\ell)$ 0.
For path $G = (S, E_\ell)$ 1, the path condition is $G = (S, E_\ell)$ 2.

Extractor synthesis (LLM sampling) is described as: $G = (S, E_\ell)$ 3 iterating until the extracted script passes all validation.

Metrics for evaluation: $G = (S, E_\ell)$ 4 (Wang et al., 2024)

6. Evaluation, Customization, and Extensibility

LLMDFA demonstrates competitive or superior performance relative to classical analyzers and other LLM-based baselines. For example, on Juliet benchmark suites (CWE-369 DBZ, CWE-80 XSS):

System / Phase	DBZ Precision	DBZ Recall	DBZ F1	XSS Precision	XSS Recall	XSS F1
LLMDFA (Extract)	100%	100%	1.00	100%	100%	1.00
LLMDFA (Summarize)	90.95%	97.57%	0.94	86.52%	96.25%	0.91
LLMDFA (Validate)	81.58%	99.20%	0.90	100%	100%	1.00
LLMDFA (Overall)	73.75%	92.16%	0.82	100%	92.31%	0.96
CodeFuseQuery	29.41%	81.08%	0.43	92.26%	79.67%	0.86
End-to-end GPT-4	~40%	~85%	—	~65%	~90%	—

Ablations show the criticality of decomposition: direct LLM-based extraction or omitting CoT/SMT steps results in major drops in F1 (e.g., direct LLM extraction: F1 ≈ 0.27, no SMT validation: F1 ≈ 0.65).

Customization is straightforward: a user provides natural-language specifications for new bug patterns (e.g., "variables dereferenced without a prior null-check" for null-pointer dereference). LLMDFA then synthesizes new extractor and summarization pipelines without requisite code change.

7. LLMDFA in Documentation-Based API Analysis

A structurally distinct application of LLMDFA logic is in DAInfer+ (Masoudian et al., 30 Mar 2026), which extracts semantic dataflow/alias information from free-form API documentation—not code—using neural embedding models (e.g., SBERT) and few-shot LLMs within an optimization-driven pipeline. The objective is to assign a memory-operation abstraction $G = (S, E_\ell)$ 5 (e.g., Alloc, Load, Store, CopyShallow, CopyDeep) to each API method $G = (S, E_\ell)$ 6 described by $G = (S, E_\ell)$ 7:

$G = (S, E_\ell)$ 8

where $G = (S, E_\ell)$ 9 is the LLM's log-probability score, $S$ 0 hard-consistency constraints, and $S$ 1 a regularizer.

Zero-shot embedding–based cluster assignment (via cosine similarity) covers ∼70% of methods with high robustness. Few-shot LLM prompting handles residual ambiguities. Downstream MaxSMT search uses Z3 for hard constraint satisfaction. Empirical results: precision/recall of 85.1%/82.7% for dataflow, 83.3%/80.4% for aliasing on large Java libraries, median pipeline latency ~90 s/library.

A plausible implication is that embedding-driven groupings offer stable, cost-effective backbone inference, while LLM scoring and optimization provide high-fidelity, flexible refinement operating under symbolic constraint guarantees (Masoudian et al., 30 Mar 2026).

8. Limitations and Future Directions

LLMDFA efficacy is currently modulated by several key factors:

Prompt lengths due to few-shot CoT examples can be large, resulting in higher inference cost and latency, most suitable for modular or per-function analysis.
Dataflow summarization may degrade on very large functions or under complex aliasing (e.g., pointers).
SMT-based path encoding can misinterpret library calls and global variable semantics, affecting end-to-end soundness.

Active research directions include developing parallel prompting and prompt-caching strategies for efficiency, integrating learned alias/invariant models, and incorporating pattern-based SMT synthesis for robust approximation of library/global semantics (Wang et al., 2024, Masoudian et al., 30 Mar 2026). Extension into richer typestate protocols and cross-language API inference is suggested in recent work.

In summary, LLMDFA encapsulates a new paradigm of LLM-coordinated, compilation-free, and customizable dataflow analysis, leveraging explicit pipeline stratification and deterministic expert tool integration for both code and natural language settings. Its empirical superiority over both hand-engineered and naïve LLM prompting baselines underscores its promise as a generalizable protocol for large-scale, extensible code and API artifact analysis.

Markdown Report Issue Upgrade to Chat

References (2)

LLMDFA: Analyzing Dataflow in Code with Large Language Models (2024)

DAInfer+: Neurosymbolic Inference of API Specifications from Documentation via Embedding Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLMDFA.