GenEscape: Hierarchical Puzzle Generator

Updated 23 February 2026

GenEscape is a hierarchical, multi-agent framework that generates photorealistic 2D escape-room puzzles ensuring solvability and logical coherence.
It decomposes the generation process into stages like functional design, scene graph reasoning, layout synthesis, and local image editing.
Iterative agent feedback refines spatial layouts and affordance cues, substantially improving solvability and shortcut avoidance compared to baselines.

GenEscape is a hierarchical, multi-agent framework designed to generate photorealistic 2D escape-room puzzle images that are both visually coherent and logically solvable. Addressing the limitations of traditional text-to-image systems—particularly with respect to spatial relationships and affordance reasoning—GenEscape decomposes the generation process into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents interact through iterative feedback, ensuring that outputs meet strict criteria for solvability, shortcut avoidance, and affordance clarity while maintaining visual quality (Shan et al., 27 Jun 2025).

1. Problem Definition and Constraints

The objective in GenEscape is to create an escape-room image $I$ and an intended solution sequence $S = [a_1, \ldots, a_k]$ ( $k \leq \ell$ ), given (i) a scene-type keyword $T$ (e.g., "classroom"), (ii) a set of key objects $O = \{o_1, \dots, o_m\}$ , and (iii) an optional maximum solution length $\ell$ . The system must guarantee two conditions:

Solvability (C1): There exists a valid player action sequence of length at most $\ell$ involving the specified objects that unlocks the room's exit.
No Shortcuts (C2): Any alternative exploitation of scene affordances yields either the official solution $S$ or logically equivalent variants, preventing unintended bypasses.

Formally, if $S^*(R)$ denotes the solution inferred by a player from representation $R$ (text, graph, layout, image), GenEscape seeks $S = [a_1, \ldots, a_k]$ 0 such that:

$S = [a_1, \ldots, a_k]$ 1
$S = [a_1, \ldots, a_k]$ 2
For all $S = [a_1, \ldots, a_k]$ 3 reachable under $S = [a_1, \ldots, a_k]$ 4’s affordances, $S = [a_1, \ldots, a_k]$ 5 is invalid or equivalent to $S = [a_1, \ldots, a_k]$ 6 in order and logic.

2. Hierarchical Multi-Agent Architecture

GenEscape operationalizes puzzle generation via a four-stage pipeline, each stage handled by specialized GPT-4o-powered agents passing scene representations and critiques. The architecture comprises:

Stage	Agent(s)	Representation	Output
Functional Design	Designer	Text + Symbolic Scene Graph	Scene description $S = [a_1, \ldots, a_k]$ 7, $S = [a_1, \ldots, a_k]$ 8, solution $S = [a_1, \ldots, a_k]$ 9
Scene Graph Reasoning	Player, Examiner	Scene Graph $k \leq \ell$ 0	Refined scene graph $k \leq \ell$ 1
Layout Synthesis	Builder, Player, Examiner	Iconic 2D Layout $k \leq \ell$ 2	Verified sketch and placement refinement
Local Image Editing	Builder, Player, Examiner	Photorealistic Image $k \leq \ell$ 3	Final image with local affordance edits

In sequence:

Functional Design: The Designer agent produces a textual room description, a candidate solution sequence, and an initial scene graph $k \leq \ell$ 4 capturing objects and attachment relations.
Symbolic Scene Graph Reasoning: The Player agent solves $k \leq \ell$ 5 to infer $k \leq \ell$ 6; the Examiner compares $k \leq \ell$ 7 to the official $k \leq \ell$ 8 and, if discrepancies exist ( $k \leq \ell$ 9), instructs graph refinements to block shortcuts or enforce intended logic.
Layout Synthesis: The Builder generates a schematic side-view, icon-labeled sketch $T$ 0 from $T$ 1; Player/Examiner iteratively verify this layout encodes $T$ 2 and not unintended alternatives.
Local Image Editing: The Builder renders $T$ 3 and $T$ 4 to a 1024×1024 full-color image; Player/Examiner detect and correct weak affordances using localized diffusion-based or cross-attention edits, iterating until discrepancies are exhausted.

A shared algorithmic loop, realized in Algorithm 1 (LaTeX pseudocode), governs refinement cycles at every stage.

3. Symbolic Scene Graph Reasoning

Scene graphs in GenEscape are formalized as $T$ 5, where $T$ 6 ( $T$ 7: room root, $T$ 8: objects), and edges $T$ 9 represent spatial or containment relationships (e.g., "key hanging from hook"). Graph refinement targets two constraints:

Reachability: Paths in $O = \{o_1, \dots, o_m\}$ 0 must enable the intended action sequence $O = \{o_1, \dots, o_m\}$ 1.
Shortcut Elimination: Alternate graph paths should not permit unintended solution sequences $O = \{o_1, \dots, o_m\}$ 2.

The multi-agent refinement is as follows: $\ell$ 3 Typical graph edits include adding/removing edges, adjusting parent-child relations, or annotating affordance constraints (e.g., "hook is firmly attached," "desk cannot be climbed").

Layout synthesis tasks the Builder with producing a minimalist, 2D side-view sketch ( $O = \{o_1, \dots, o_m\}$ 3) using textual prompts to GPT-4o—no low-level diffusion model is directly invoked. Layouts employ ASCII- or emoji-style icons, spatially arranged as dictated by $O = \{o_1, \dots, o_m\}$ 4. Iterative Player/Examiner loops verify that $O = \{o_1, \dots, o_m\}$ 5 encodes only the intended solution.

In image rendering, the Builder leverages GPT-4o’s "text-to-image" multimodal API to synthesize a photorealistic 1024×1024 image from $O = \{o_1, \dots, o_m\}$ 6 and $O = \{o_1, \dots, o_m\}$ 7. If Player/Examiner identify ambiguous or insufficient affordance cues (e.g., key shape unclear, color mismatches), the Examiner issues local edit prompts (e.g., "enlarge the hook’s curve"), and the Builder performs region-specific updates via diffusion or cross-attention.

No explicit loss function is used; optimization is governed by the multi-agent critique-refinement loop at every abstraction level.

5. Quantitative Evaluation

GenEscape’s efficacy is assessed using the following metrics (mean percentages over 15 test scenes, averaged by 10 human annotators):

Solvability Rate ( $O = \{o_1, \dots, o_m\}$ 8): Fraction where players infer the official $O = \{o_1, \dots, o_m\}$ 9.
Shortcut Avoidance ( $\ell$ 0): Fraction where unintended solutions are blocked.
Spatial Alignment ( $\ell$ 1): Fidelity to the input graph.
LongCLIP Score ( $\ell$ 2): Semantic image-text similarity.
#Gen Calls: Average number of image generation API calls per scene.

Selected results (Table 1):

System Pipeline	Solv (%)	Short (%)	Align (%)	CLIP	#Gen
Vanilla GPT-4o	3.3	0.0	N/A	—	—
+Description only	6.7	3.3	N/A	—	—
+Description+Scene Graph	6.7	13.3	26.7	—	—
+…+Layout	10.0	20.0	13.3	—	13.2
+…+Image editing	20.0	16.7	23.3	—	15.8
GenEscape (all stages)	53.3	46.6	36.7	0.32	4.5

GenEscape achieves substantial improvements in both solvability and shortcut avoidance versus baseline approaches, while reducing the mean number of image generation calls due to early-stage logical verification (Shan et al., 27 Jun 2025).

6. Contributions, Advantages, and Limitations

Hierarchical decomposition allows GenEscape to enforce puzzle solvability, visual coherence, and affordance clarity via stage-wise specialization and agent-based critique-refinement. The iterative Player-Examiner feedback loop at each abstraction layer ensures solution chains are uniquely encoded by scene affordances prior to computationally expensive rendering.

Key limitations include:

Support is restricted to "fully visible" objects; hidden-object puzzles (e.g., requiring interaction with drawers) are not represented.
Solution sequences longer than eight steps or involving more than eight objects challenge convergence, with scene graph and layout refinement slowing or failing.
Dynamic scene progression (e.g., showing intermediate states after actions) is not supported, limited by GPT-4o’s object-level editing fidelity.

These factors delimit the system’s current scope, but the structured, multi-agent, stage-wise approach demonstrates the viability of bridging text-to-image pipelines with logically coherent, functionally valid escape-room puzzle generation (Shan et al., 27 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GenEscape System.

GenEscape: Hierarchical Puzzle Generator

1. Problem Definition and Constraints

2. Hierarchical Multi-Agent Architecture

3. Symbolic Scene Graph Reasoning

4. Layout Synthesis and Visual Refinement

5. Quantitative Evaluation

6. Contributions, Advantages, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GenEscape: Hierarchical Puzzle Generator

1. Problem Definition and Constraints

2. Hierarchical Multi-Agent Architecture

3. Symbolic Scene Graph Reasoning

4. Layout Synthesis and Visual Refinement

5. Quantitative Evaluation

6. Contributions, Advantages, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research