KLPEG: Graph-Enhanced Incremental Game Testing
- KLPEG is a framework that integrates a structured knowledge graph with LLM-based parsing to generate targeted test cases for incremental game updates.
- It employs multi-hop graph reasoning and soft-attention mechanisms to localize update impacts and improve test precision.
- Evaluations in Overcooked and Minecraft show superior bug detection and test efficiency compared to baseline methods.
The KLPEG (Knowledge Graph-Enhanced LLM for Incremental Game PlayTesting) framework is a methodology that integrates persistent knowledge representation via a structured Knowledge Graph (KG) with LLM-driven parsing and test-case generation to enable efficient, update-focused playtesting for games subject to continuous, incremental change. The framework is designed to address the challenges of specificity, scalability, and knowledge accumulation in the context of modern live-service and sandbox games, where rapid iteration and frequent updates necessitate adaptive, targeted, and reusable testing workflows.
1. Formalization and Motivations
KLPEG directly addresses the limitations of conventional LLM-based playtesting pipelines, which lack the capacity for persistent, structured memory and thus struggle to efficiently localize and validate the precise consequences of incremental game updates. Given a knowledge graph representing game knowledge at version , and a natural-language update log describing changes to create version , KLPEG formalizes the incremental playtesting objective as inducing a function
where is a set of automatically constructed, update-tailored test cases. The strategy prioritizes efficiency by isolating the subgraph implicated by the update, enabling targeted test case generation rather than global regression.
2. Construction and Maintenance of the Knowledge Graph
The core of KLPEG is a directed, labeled knowledge graph with the following semantics:
- Nodes (): Represent game elements (e.g., "Wooden Pickaxe"), tasks/quests (e.g., "Craft Iron Sword"), and update artifacts (e.g., "Added Diamond Pickaxe"). Each node is associated with a feature vector encoding categorical, textual, or metadata attributes.
- Relations (): A finite set of relation types such as “mines,” “depends_on,” or “triggers.”
- Triples and Edges (): Relation-labeled edges , possibly carrying per-relation weights or attributes.
Population of the KG is accomplished through three extractor pipelines:
- Parsing recorded RL-agent trajectories.
- Analyzing state-change logs via scripts.
- Applying regular expression rules to textual prompts.
Each extractor emits triples of the form , entered as knowledge into the graph. Eight principal triple types are catalogued (e.g., "Game Element Interaction," "Task Dependency," "Scene Transition"), representing a systematic ontology for incremental test planning.
3. Multi-Hop Graph Reasoning and Update Localization
Upon receiving a new update log , the update is parsed by an LLM-based extractor to yield , the set of new or modified triples. The updated knowledge graph is then
To localize the impact of the update, KLPEG employs bounded multi-hop traversal: for each updated node , and collects the union . An alternative, soft-attention-based scoring function is defined for more nuanced propagation via
where denote node embeddings and is a learned relation-specific matrix, and is the -typed neighborhood of . Propagation and attention mechanics permit the ranking and message passing over affected subgraphs; embeddings are updated via graph neural network layers: with denoting a nonlinearity.
4. LLM Integration and Prompting Mechanisms
LLMs execute four roles within KLPEG, coordinated via prompt templates that ensure structured, JSON output:
- Knowledge Extractor Prompt: Parses observations to knowledge triples (“You are a game data extractor”).
- Update Log Parser Prompt: Identifies new/modified triples from update logs (“You are an update log analyzer”).
- Impact Scope Inferencer Prompt: Invokes graph reasoning given a changed element to produce .
- Test Case Generator Prompt: Given affected content, generates stepwise test cases (“Input: {impact_description}”).
All LLM outputs are strictly in JSON, supporting pipeline automation and subsequent test orchestration.
5. End-to-End Test Case Generation Algorithm
KLPEG implements an incremental test case generation pipeline as follows:
- Input: Prior KG and update log .
- Parse Update: Apply LLM to extract .
- Synchronize KG: Update to .
- Impact Inference: For each updated node , compute .
- Task Selection: Find all tasks with dependency paths to nodes in .
- Test Case Synthesis: For each , prompt LLM with relevant context to generate as a JSON test case.
- Output: Return the set of all generated test cases.
This modular architecture enables adaptation to various underlying LLM back-ends and supports robust automation.
6. Experimental Evaluation and Benchmarks
KLPEG has been evaluated in the Overcooked (2D grid-world) and Minecraft (3D sandbox, high-level API) environments, simulating multiple update types (feature additions, bug fixes, rule changes) and the controlled injection of known bugs. The following metrics are utilized:
| Metric | Definition | Observed KLPEG Result |
|---|---|---|
| Targeted Element Coverage | Updated elements accessed by tests/total updated elements | 100% or near-100% |
| Targeted Interaction Ratio | ||
| Bug Detection Ratio | Fraction of injected, update-related bugs triggered and flagged | 1.00 (Overcooked), 0.93–1.00 (Minecraft) |
| Average Steps | Mean actions per test case | 6–10 |
| Total Test Time | Time from reading to test completion | 30–70 seconds |
KLPEG is benchmarked against RANDOM (random action sampling), GA (genetic algorithm), CD-PPO (curiosity-driven RL), and a naïve LLM baseline (generating tests from without a KG). KLPEG achieves strictly superior focus and efficiency, with ablation (removal of the KG) causing 20–30% loss in interaction ratio and degraded bug detection performance.
7. Limitations, Extensions, and Research Directions
KLPEG’s primary advantage is a persistent, update-aware, causally structured memory, enabling fine-grained, reusable, and tailored playtesting. Limitations and avenues for further work include:
- Graph Completeness: Incomplete or erroneous triples degrade impact localization; graph completion or human-in-the-loop curation is warranted.
- Scalability: Large environments produce massive graphs; investigation into partitioned, hierarchical, or specialized graph databases is prioritized.
- Log Ambiguity: Terse or ambiguous update logs limit performance; enhanced log parsing or automated, code-driven summarization is a possible solution.
- Extraction Cost: Extractor scripts are implemented efficiently for standard logs but visual or highly unstructured logs would require novel extraction mechanisms.
- Coverage of Indirect Regressions: Focusing primarily on update-adjacent regions may bypass regressions in legacy areas; hybrid approaches combining targeted and global regression testing may provide optimal breadth and depth.
- Broader Applicability: Prospective extensions include modeling of UI logic, physics, broader software engineering domains (web, enterprise), and standardization of update-log formats.
The KLPEG framework establishes that structured, graph-based knowledge integration with LLMs yields substantial gains in the practicality and scientific rigor of automated incremental playtesting, supporting both present-day industry needs and future research on adaptive, knowledge-driven software testing workflows (Mu et al., 4 Nov 2025).