Papers
Topics
Authors
Recent
2000 character limit reached

KLPEG: Graph-Enhanced Incremental Game Testing

Updated 12 November 2025
  • KLPEG is a framework that integrates a structured knowledge graph with LLM-based parsing to generate targeted test cases for incremental game updates.
  • It employs multi-hop graph reasoning and soft-attention mechanisms to localize update impacts and improve test precision.
  • Evaluations in Overcooked and Minecraft show superior bug detection and test efficiency compared to baseline methods.

The KLPEG (Knowledge Graph-Enhanced LLM for Incremental Game PlayTesting) framework is a methodology that integrates persistent knowledge representation via a structured Knowledge Graph (KG) with LLM-driven parsing and test-case generation to enable efficient, update-focused playtesting for games subject to continuous, incremental change. The framework is designed to address the challenges of specificity, scalability, and knowledge accumulation in the context of modern live-service and sandbox games, where rapid iteration and frequent updates necessitate adaptive, targeted, and reusable testing workflows.

1. Formalization and Motivations

KLPEG directly addresses the limitations of conventional LLM-based playtesting pipelines, which lack the capacity for persistent, structured memory and thus struggle to efficiently localize and validate the precise consequences of incremental game updates. Given a knowledge graph GtG_t representing game knowledge at version tt, and a natural-language update log Utt+1U_{t \rightarrow t+1} describing changes to create version t+1t+1, KLPEG formalizes the incremental playtesting objective as inducing a function

F:(Gt,Utt+1)Tt+1\mathcal{F} : (G_t, U_{t\rightarrow t+1}) \longmapsto T_{t+1}

where Tt+1T_{t+1} is a set of automatically constructed, update-tailored test cases. The strategy prioritizes efficiency by isolating the subgraph implicated by the update, enabling targeted test case generation rather than global regression.

2. Construction and Maintenance of the Knowledge Graph

The core of KLPEG is a directed, labeled knowledge graph G=(V,E,R)G = (V, E, \mathcal{R}) with the following semantics:

  • Nodes (VV): Represent game elements (e.g., "Wooden Pickaxe"), tasks/quests (e.g., "Craft Iron Sword"), and update artifacts (e.g., "Added Diamond Pickaxe"). Each node vv is associated with a feature vector fvRd\mathbf{f}_v \in \mathbb{R}^d encoding categorical, textual, or metadata attributes.
  • Relations (R\mathcal{R}): A finite set of relation types such as “mines,” “depends_on,” or “triggers.”
  • Triples and Edges (EE): Relation-labeled edges (u,r,v)(u, r, v), possibly carrying per-relation weights or attributes.

Population of the KG is accomplished through three extractor pipelines:

  1. Parsing recorded RL-agent trajectories.
  2. Analyzing state-change logs via scripts.
  3. Applying regular expression rules to textual prompts.

Each extractor emits triples of the form (Head,Relation,Tail)(\mathrm{Head}, \mathrm{Relation}, \mathrm{Tail}), entered as knowledge into the graph. Eight principal triple types are catalogued (e.g., "Game Element Interaction," "Task Dependency," "Scene Transition"), representing a systematic ontology for incremental test planning.

3. Multi-Hop Graph Reasoning and Update Localization

Upon receiving a new update log Utt+1U_{t\to t+1}, the update is parsed by an LLM-based extractor to yield ΔE\Delta E, the set of new or modified triples. The updated knowledge graph is then

Gt+1=(Vt,EtΔE,R).G_{t+1} = (V_t, E_t \cup \Delta E, \mathcal{R}).

To localize the impact of the update, KLPEG employs bounded multi-hop traversal: Iu={vVt+1path  ur1v1rkv,  kK}\mathcal{I}_u = \left\{ v \in V_{t+1} \mid \exists\, \text{path}\; u \xrightarrow{r_1} v_1 \cdots \xrightarrow{r_k} v,\; k \le K \right\} for each updated node uu, and collects the union uIu\bigcup_u \mathcal{I}_u. An alternative, soft-attention-based scoring function is defined for more nuanced propagation via

s(u,v)=exp(huWrhv)wNr(u)exp(huWrhw)s(u, v) = \frac{\exp(\mathbf{h}_u^\top W_r \mathbf{h}_v)}{\sum_{w\in\mathcal{N}_r(u)} \exp(\mathbf{h}_u^\top W_r \mathbf{h}_w)}

where hu,hv\mathbf{h}_u,\,\mathbf{h}_v denote node embeddings and WrW_r is a learned relation-specific matrix, and Nr(u)\mathcal{N}_r(u) is the rr-typed neighborhood of uu. Propagation and attention mechanics permit the ranking and message passing over affected subgraphs; embeddings are updated via graph neural network layers: hv(+1)=σ((u,r)N(v)αurWrhu()),αur=s(u,v)\mathbf{h}_v^{(\ell+1)} = \sigma \left( \sum_{(u, r)\in \mathcal{N}(v)} \alpha_{ur} W_r \mathbf{h}_u^{(\ell)} \right), \quad \alpha_{ur} = s(u, v) with σ\sigma denoting a nonlinearity.

4. LLM Integration and Prompting Mechanisms

LLMs execute four roles within KLPEG, coordinated via prompt templates that ensure structured, JSON output:

  • Knowledge Extractor Prompt: Parses observations to knowledge triples (“You are a game data extractor”).
  • Update Log Parser Prompt: Identifies new/modified triples from update logs (“You are an update log analyzer”).
  • Impact Scope Inferencer Prompt: Invokes graph reasoning given a changed element to produce Iu\mathcal{I}_u.
  • Test Case Generator Prompt: Given affected content, generates stepwise test cases (“Input: {impact_description}”).

All LLM outputs are strictly in JSON, supporting pipeline automation and subsequent test orchestration.

5. End-to-End Test Case Generation Algorithm

KLPEG implements an incremental test case generation pipeline as follows:

  1. Input: Prior KG GtG_t and update log UU.
  2. Parse Update: Apply LLM to extract ΔE\Delta E.
  3. Synchronize KG: Update to Gt+1=GtΔEG_{t+1} = G_t \cup \Delta E.
  4. Impact Inference: For each updated node uu, compute Iu\mathcal{I}_u.
  5. Task Selection: Find all tasks TVT \subseteq V with dependency paths to nodes in uIu\bigcup_u \mathcal{I}_u.
  6. Test Case Synthesis: For each tTt \in T, prompt LLM with relevant context to generate (Objective,Steps)(\text{Objective}, \text{Steps}) as a JSON test case.
  7. Output: Return the set of all generated test cases.

This modular architecture enables adaptation to various underlying LLM back-ends and supports robust automation.

6. Experimental Evaluation and Benchmarks

KLPEG has been evaluated in the Overcooked (2D grid-world) and Minecraft (3D sandbox, high-level API) environments, simulating multiple update types (feature additions, bug fixes, rule changes) and the controlled injection of known bugs. The following metrics are utilized:

Metric Definition Observed KLPEG Result
Targeted Element Coverage Updated elements accessed by tests/total updated elements 100% or near-100%
Targeted Interaction Ratio IntRatio=#actions on updated elements#total actions\mathrm{IntRatio} = \frac{\#\text{actions on updated elements}}{\#\text{total actions}} 0.90\geq 0.90
Bug Detection Ratio Fraction of injected, update-related bugs triggered and flagged 1.00 (Overcooked), 0.93–1.00 (Minecraft)
Average Steps Mean actions per test case 6–10
Total Test Time Time from reading UU to test completion 30–70 seconds

KLPEG is benchmarked against RANDOM (random action sampling), GA (genetic algorithm), CD-PPO (curiosity-driven RL), and a naïve LLM baseline (generating tests from UU without a KG). KLPEG achieves strictly superior focus and efficiency, with ablation (removal of the KG) causing 20–30% loss in interaction ratio and degraded bug detection performance.

7. Limitations, Extensions, and Research Directions

KLPEG’s primary advantage is a persistent, update-aware, causally structured memory, enabling fine-grained, reusable, and tailored playtesting. Limitations and avenues for further work include:

  • Graph Completeness: Incomplete or erroneous triples degrade impact localization; graph completion or human-in-the-loop curation is warranted.
  • Scalability: Large environments produce massive graphs; investigation into partitioned, hierarchical, or specialized graph databases is prioritized.
  • Log Ambiguity: Terse or ambiguous update logs limit performance; enhanced log parsing or automated, code-driven summarization is a possible solution.
  • Extraction Cost: Extractor scripts are implemented efficiently for standard logs but visual or highly unstructured logs would require novel extraction mechanisms.
  • Coverage of Indirect Regressions: Focusing primarily on update-adjacent regions may bypass regressions in legacy areas; hybrid approaches combining targeted and global regression testing may provide optimal breadth and depth.
  • Broader Applicability: Prospective extensions include modeling of UI logic, physics, broader software engineering domains (web, enterprise), and standardization of update-log formats.

The KLPEG framework establishes that structured, graph-based knowledge integration with LLMs yields substantial gains in the practicality and scientific rigor of automated incremental playtesting, supporting both present-day industry needs and future research on adaptive, knowledge-driven software testing workflows (Mu et al., 4 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to KLPEG Framework.