Word2World: LLM-Driven World Generation
- Word2World is a framework that converts text prompts into interactive worlds using multi-stage LLM pipelines.
- It employs a modular workflow that includes story generation, information extraction, iterative world assembly, and tile retrieval for feedback-driven refinement.
- Empirical evaluations highlight high playability and accuracy, emphasizing the necessity of multi-round assembly and agent-based feedback loops.
Word2World denotes a class of systems, architectures, and benchmarks that operationalize LLMs to generate, from textual prompts, symbolic or concrete representations of worlds in games, planning environments, or visual domains. These frameworks address both procedural content generation and symbolic planning, with variants that build coherent, playable environments or domain models exclusively via LLM-driven zero- or few-shot methodologies. At the technical level, Word2World systems deploy multi-module LLM-based pipelines for extraction, translation, world assembly, and retrieval; they are evaluated on playability, coherence, and execution-based metrics, with ablation studies underpinning empirical regimen (Nasir et al., 2024, Hu et al., 18 Feb 2025).
1. Pipeline Architecture and Modular Workflow
Word2World systems are founded on a multi-stage pipeline facilitating the translation of text into interactive world layouts or symbolic planning domains. The canonical procedure comprises four principal modules:
- Story Generation (LLM₁): Given a natural language prompt, an LLM constructs a multiline adventure narrative with explicit protagonists, antagonists, and an enumerated list of objectives.
- Information Extraction (LLM₂–LLM₇): Successive LLM calls parse the story to extract:
- Characters and their descriptions
- A tileset mapping narrative elements to environmental/platform tiles and interactive objects
- Explicit objectives
- Important, walkable, and object tile classification
- World Assembly (LLM₈–LLM₉, Multi-round): Iteratively generates alphanumeric grid layouts (terrain only, then placement of actors, objectives, and key objects) with post-processing for uniformity and singleton enforcement; coherence and playability are scored at each round, enabling feedback-driven refinement.
- Tile Retrieval: Embedding-based matching (using DistilBERT + cosine similarity) to convert narrative tile concepts into concrete sprites for playable map generation.
Optional extensions include agent-based playability evaluation and automatic coherence rating via an LLM judge (Nasir et al., 2024).
2. Algorithmic Procedures and Mathematical Formulations
Set out in formal pseudocode and LaTeX, key algorithms in Word2World include:
- Cosine Similarity (Tile Retrieval):
- Accuracy Metrics:
- Novelty (Grid Distance):
- Agent Reward Function:
The system can be instantiated for symbolic domains (e.g., Text2World), where input natural language planning domain description is converted to a valid PDDL domain , formally
with , fluents, actions. Evaluation is multi-criteria, including executability, Levenshtein-structural similarity, component-wise F1 scores (Hu et al., 18 Feb 2025).
3. Prompt Engineering, Parsing, and Feedback Mechanisms
Each module in the pipeline relies on standardized zero- or few-shot system/assistant prompts instructing the LLM to produce structured outputs (JSON, bullet lists, grid arrays, legends). Inputs cascade prior context (e.g., story, extracted character/object sets, previously generated grids, evaluation scores) across rounds. The system enables iterative feedback loops: after each assembly round, previously generated world grid and evaluation metrics (e.g., coherence rating, playability status, path lengths) are appended to the next prompt, fostering layout refinement and increased objective/task alignment (Nasir et al., 2024).
In Text2World, chain-of-thought prompting enhances PDDL generation with temperature=0 and parser error-looping allows for correction up to retries, improving executability and F1 scores (Hu et al., 18 Feb 2025).
4. Experimental Framework and Quantitative Evaluation
Evaluation spans both LLM-based and conventional PCG/PL metrics:
- LLM-Based: Coherence rating (0–100), LLM-Agent reward, completion rate.
- PCG Metrics: Playability (A* for objective paths), path length, novelty (-threshold), successful completion within ≤10 tries.
- Planning Domains: Executability, Levenshtein similarity, F1 for predicates/parameters/preconditions/effects.
Representative results from (Nasir et al., 2024):
| Ablation Variant | Playability (%) | Character Acc. (%) | Important-Tile Acc. (%) |
|---|---|---|---|
| Word2World (full, R=3) | ≈ 90 | ≈ 85 | ≈ 90 |
| No goals/important tiles | ≈ 20 | ≈ 50 | --- |
| One round (R=1) | ≈ 60 | --- | --- |
| Direct-generation | ≈ 10 | --- | --- |
In Text2World (Hu et al., 18 Feb 2025), RL-trained DeepSeek achieves executability, predicate F1, and effect F1 with three correction attempts, exceeding model-only LLM runners (GPT-4o: exec, predicate F1). Ablations reveal fully decomposed pipelines (multiple rounds + extractors) dominate over direct generation in both playability and coherence.
5. Connections to Symbolic and Visual World Models
Word2World interfaces with broader world modeling paradigms, including:
- WorldDreamer (“Words-to-World”): Employs VQGAN tokenization, masked visual token prediction, spatial-temporal Transformer (STPT), and multimodal language/action prompt fusion for visual video synthesis from text or image context (Wang et al., 2024). This paradigm similarly embodies a sequential token-mapping process, parallel decoding, classifier-free guidance, and supports text-to-video, image-to-video, video editing, and action-conditioned synthesis.
- Symbolic Planning Models (Text2World): Benchmarks structured language→PDDL conversion, highlighting abstraction and dynamic modeling as core challenges. RL-based reasoning models currently surpass autoregressive LLMs in extracting and encoding action dynamics, types, and preconditions/effects from text (Hu et al., 18 Feb 2025).
6. Limitations, Ablation Insights, and Future Directions
Empirical studies demonstrate that omitting goal or important-tile extraction steps in Word2World leads to catastrophic degradation (playability ≈20%), and reducing multi-round assembly to a single step or round also substantially impairs world correctness. In symbolic domains, precondition and effect extraction remain bottlenecks for LLMs, with all best models falling short of F1.
Ablation studies recommend full decomposed extraction, multi-round iterative assembly, and feedback loops for optimal performance. Test-time scaling (increasing correction attempts) and agent-based fine-tuning yield monotonic gains, implying resource allocation strategies merit further research. Across both visual and symbolic pipelines, end-to-end prompt orchestration and embedding retrieval are essential, and modularity enables transfer and adaptation.
Suggested future works include scaling context lengths, integrating temporal consistency losses, joint audio modeling, multimodal extension, and diffusion prior combinations for diversity in visual world generation (Wang et al., 2024); in symbolic domains, more concrete NL descriptions and agent-centric supervised datasets can further improve semantic fidelity (Hu et al., 18 Feb 2025).
7. Representative Example
A prototypical run in Word2World initiates with the prompt-driven generation of a story set in fantastical environments, enumerating clear objectives. Extractors parse tiles, characters, and objectives. Through 3 iterative rounds, a grid world is assembled, placing starting position, antagonists, and important objects. Each step is algorithmically post-processed, and grid symbols are mapped to sprites via semantic embedding. Evaluations yield coherence 82/100, playable pathlengths (e.g., 42), and agent rewards (e.g., 0.28 per episode). The final world integrates visual and narrative coherence, playable objectives, and aligns environmental features with story dynamics (Nasir et al., 2024).
A plausible implication is that the Word2World paradigm constitutes a generalizable protocol for text-driven world assembly that can be adapted across game content generation, planning domain synthesis, and multimodal video modeling, contingent on the pipeline’s modular structure and zero-shot LLM adaptability.