This paper investigates the capability of LLMs to generate both game rules and levels simultaneously. Unlike previous work that primarily focused on level generation for fixed game rules, this research proposes a framework called LLMGG (Generating Games via LLMs) that leverages Video Game Description Language (VGDL) as a structured representation for both aspects of game design.
The core of the LLMGG framework is an LLM that receives a text-based prompt describing the desired game. The LLM is expected to output the game's rules and levels in VGDL format. This VGDL output can then be parsed by a compatible engine, such as GVGAI gym, to create a playable game instance. The framework is designed to be general and can potentially interact iteratively with LLMs for refinement or be used with different LLM backbones.
Video Game Description Language (VGDL) is chosen as the representation language due to its human-readable yet machine-parsable nature. A VGDL game description typically includes four main components:
- SpriteSet: Defines the types of objects (sprites) that exist in the game and their properties.
- LevelMapping: Maps characters used in the level text file to one or more sprites defined in the SpriteSet.
- InteractionSet: Defines what happens when two different types of sprites collide or interact.
- TerminationSet: Defines the conditions under which the game ends (win, lose, draw).
The paper explores the impact of prompt design on the LLM's ability to generate correct VGDL. Prompts consist of a basic instruction (requesting a VGDL game and level) and optional context. The context can include:
- Level notation mapping (e.g., 'W' for wall).
- VGDL grammar descriptions (Base rules, Type Constraints C1 and C2 specifying allowed sprite classes, interaction methods like
killSprite
or removeSprite
, and termination classes).
- Examples of complete VGDL games.
Experiments were conducted using GPT-3.5, GPT-4, and Gemma 7B with seven different prompt variations, combining these context elements. Each prompt was tested over 10 trials for generating a simple Maze game.
To evaluate the generated output, the paper defines rule-based text validation metrics:
- Parsable: The VGDL syntax must be valid and recognizable by a VGDL engine.
- Logical: The generated VGDL must define all mandatory components (SpriteSet, LevelMapping, InteractionSet, TerminationSet) and ensure basic game logic completeness (e.g., defining interactions for avatar-wall and avatar-goal, having a win condition).
- Mappable: Characters used in the level must have correct mappings to sprites defined in the rules, and essential sprites must be present in the level.
The experimental results highlight the critical role of context. Prompts without sufficient context struggled to generate valid VGDL. Adding VGDL grammar and examples significantly improved the parsability and logical correctness of the generated output, especially for GPT-4. Gemma 7B, in contrast, consistently failed to produce parsable VGDL in most trials.
A key finding relates to LLM hallucination, particularly concerning game logic. Even when generating parsable VGDL, LLMs sometimes created illogical rules. For instance, using avatar goal > killSprite
in the InteractionSet, which in VGDL means the avatar is removed upon collision with the goal, was often misinterpreted by LLMs as the goal being removed. This demonstrates a mismatch between the LLM's natural language understanding of word order and the specific syntax conventions of VGDL. The paper found that aligning the VGDL syntax with the LLM's likely understanding, such as using goal avatar > killSprite
or introducing a custom interaction like removeSprite
where the second sprite (goal) is removed (avatar goal > removeSprite
), could mitigate this hallucination. Prompts incorporating this syntactic alignment (P5 and P7) showed significantly better results in producing logically correct and playable games.
GPT-4 with the most comprehensive context (P7), which included level notation, full grammar description, the proposed removeSprite
interaction method constraint, and a VGDL example, achieved 100% success rate in generating games that were Parsable, Logical, Mappable, and ultimately Correct (playable with expected behavior) in all 10 trials. Lower context prompts or less capable LLMs resulted in a higher frequency of various errors, including syntax errors, missing components, illogical interactions, and mapping issues (see Appendix Table 3 for error breakdown).
The paper concludes that LLMs hold significant potential for generating game rules and levels simultaneously using VGDL, enabling non-experts to prototype games via natural language prompts. However, it also emphasizes that hallucination regarding specific domain syntax (like VGDL interaction semantics) is a limitation. Providing rich context and potentially adapting the domain language syntax to better align with LLM priors can improve performance. Despite these advancements, human intervention is still necessary, especially for generating more complex and diverse games, to correct errors and guide the generation process. Future work could explore extending this approach to 3D game generation.