LLMCFG-TGen: LLM-Driven CFG Test Generation
- The paper introduces a method that leverages LLMs to construct precise control flow graphs from natural language use cases, enabling systematic test generation.
- It employs zero-shot prompting and DFS-based path enumeration to convert CFG paths into schema-compliant, abstract test cases.
- Benchmark results and practitioner feedback highlight high accuracy and minimal redundancy compared to direct LLM test generation approaches.
Test Generation based on LLM-generated Control Flow Graphs (LLMCFG-TGen) is an approach for automatic test case creation from natural language (NL) use-case descriptions. The method leverages LLMs to generate precise Control Flow Graphs (CFGs) that explicitly capture all behavioral branches, ensuring structured, comprehensive, and non-redundant requirements-based test generation (RBTG). By coupling LLM-driven semantic reasoning with systematic modeling, LLMCFG-TGen addresses core deficiencies of direct LLM-based test generation, such as incomplete coverage and failure to encode conditional logic, and establishes a robust path from NL requirements to systematic testing artifacts (Yang et al., 6 Dec 2025).
1. Formal CFG Representation for Use Cases
LLMCFG-TGen models each use case as a directed CFG , where is a finite set of nodes representing use-case steps from main, alternative, and exception flows; is the set of directed edges encoding possible transitions; is the unique entry node; and is the exit set. Guard conditions—labels for edge predicates, such as "true"/"false" for conditional branches—are attached as textual annotations. Formally,
CFG nodes and edges are directly derived from the NL steps in the use case, mapping sequential actions, conditional splits, exception handling, and loops (with back-edges) into the graph structure. For path enumeration, cyclic paths are pruned to avoid combinatorial explosion.
2. LLM-Based CFG Construction Algorithm
CFG construction is performed using LLM zero-shot prompting, architected as a multi-block input specifying role, extraction instructions, algorithm specification in NL pseudocode, output schema, and the raw use case. The prompt instructs the LLM to extract steps as nodes, generate edges for both linear and branching control flow, and avoid isolated nodes. The required output is a JSON object with "nodes" and "edges."
A typical CFG generation algorithm entails: merging all use-case flows into a sequence, assigning each step to a graph node, starting from , and iterating over steps to establish edges for sequential or conditional logic, with guarding conditions as appropriate. Postprocessing includes orphan, isolation, and reachability checks on the output; if validation fails, the generation is retried at fixed temperature.
3. Path Enumeration from LLMCFG
Execution paths from to nodes are enumerated via depth-first search (DFS) with cycle/pruning. Each path represents a unique test-relevant trace through the use-case logic. Whenever a node is revisited (to avoid infinite cycles), the current path is saved and backtracked. The resultant paths are lists of node ids (with condition steps annotated). The computational complexity is , where is exponential in the number of branching points in an acyclic CFG ( for branches).
4. Test Case Synthesis from Execution Paths
For each enumerated path, a second structured LLM prompt translates the path and original use case into an abstract test case. The output covers:
- Title: Scenario summary derived from early steps.
- Preconditions: Extracted from the use-case preamble or implicit actor/system setup.
- Steps: Sequential description, interleaving path actions and explicit verification of guard conditions ("Verify condition X is {true|false}.").
- Expected Result: Derived from the outcome encoded at the exit node.
A schema-compliant JSON is required for each generated test case. If parsing or validation fails, one retry is permitted.
5. Quantitative and Practitioner Evaluation
LLMCFG-TGen's effectiveness is benchmarked by rigorous metrics:
- Node F1: $0.895$ Edge F1: $0.761$ nGED: $0.933$ on 42 use cases
- Test-case coverage and redundancy, compared to LLM (Direct) and AGORA baselines:
| Dataset | LLM (Direct) | AGORA | LLMCFG-TGen |
|---|---|---|---|
| Total test cases | 137 | 97 | 103 |
| DiscRate (%) | 57.0 | 33.3 | 2.38 |
| Avg | 1.12 | 0.45 | 0.02 |
- Discrepancy Rate (DiscRate) measures deviation from ground-truth path counts; LLMCFG-TGen achieves near-exact matching with only discrepancy, outperforming alternatives.
- LLM model comparison (under identical prompt/schema):
| Model | Node F1 | Edge F1 | nGED | DiscRate (%) |
|---|---|---|---|---|
| GPT-4o | 0.895 | 0.761 | 0.933 | 2.30 |
| Gemini 2.5 | 0.862 | 0.683 | 0.912 | 16.67 |
| LLaMA4 | 0.865 | 0.696 | 0.909 | 11.90 |
- GPT-4o achieves the highest accuracy and speed for this workflow.
Practitioner assessment (1 QA manager, 3 senior test engineers, 20 use-cases, four dimensions via Likert scale):
| Dimension | AGORA | LLMCFG-TGen |
|---|---|---|
| Relevance | 4.25 | 4.75 |
| Completeness | 3.84 | 4.64 |
| Correctness | 3.74 | 4.51 |
| Clarity | 3.73 | 4.48 |
Practitioners reported that LLMCFG-TGen produced more comprehensive, less redundant, and easier-to-follow test cases.
6. Worked Example and Practitioner Feedback
The "View Alert" use case illustrates the CFG and path-based approach:
- CFG nodes include: Start, View Alert, Is Alert Active? (predicate), Show Details, Display "No Alert", End.
- Paths:
- Start → View Alert → Is Alert Active? (true) → Show Details → End
- Start → View Alert → Is Alert Active? (false) → Display "No Alert" → End
Abstract test cases correspond 1:1 with these unique control flow paths.
Practitioner commentary emphasizes that LLMCFG-TGen eliminates both redundant and missing cases, contrasting with direct LLM prompting approaches, which produced overlapping tests (Yang et al., 6 Dec 2025).
7. Strengths, Limitations, and Future Directions
LLMCFG-TGen addresses hallucination, case repetition, and incomplete coverage found in prior LLM-based test-generation by imposing explicit CFG-based structure. Its main limitations are:
- Processes one use case at a time—no batching or end-to-end prioritization.
- Output is abstract (does not generate concrete input data or executable scripts).
- No current support for test-case prioritization.
Planned future enhancements include batch/modular CFG construction, prioritization via path weighting, extension to executable scripts through linkage to code stubs, and support for human-in-the-loop resolution of ambiguous or multimodal requirements (Yang et al., 6 Dec 2025).
LLMCFG-TGen demonstrates that integrating LLM-driven semantic analysis with formalized program modeling delivers systematic, high-quality test suites directly from natural language requirements, effectively bridging NL requirements and systematic test generation.