Papers
Topics
Authors
Recent
Search
2000 character limit reached

GenEscape: Hierarchical Puzzle Generator

Updated 23 February 2026
  • GenEscape is a hierarchical, multi-agent framework that generates photorealistic 2D escape-room puzzles ensuring solvability and logical coherence.
  • It decomposes the generation process into stages like functional design, scene graph reasoning, layout synthesis, and local image editing.
  • Iterative agent feedback refines spatial layouts and affordance cues, substantially improving solvability and shortcut avoidance compared to baselines.

GenEscape is a hierarchical, multi-agent framework designed to generate photorealistic 2D escape-room puzzle images that are both visually coherent and logically solvable. Addressing the limitations of traditional text-to-image systems—particularly with respect to spatial relationships and affordance reasoning—GenEscape decomposes the generation process into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents interact through iterative feedback, ensuring that outputs meet strict criteria for solvability, shortcut avoidance, and affordance clarity while maintaining visual quality (Shan et al., 27 Jun 2025).

1. Problem Definition and Constraints

The objective in GenEscape is to create an escape-room image II and an intended solution sequence S=[a1,,ak]S = [a_1, \ldots, a_k] (kk \leq \ell), given (i) a scene-type keyword TT (e.g., "classroom"), (ii) a set of key objects O={o1,,om}O = \{o_1, \dots, o_m\}, and (iii) an optional maximum solution length \ell. The system must guarantee two conditions:

  • Solvability (C1): There exists a valid player action sequence of length at most \ell involving the specified objects that unlocks the room's exit.
  • No Shortcuts (C2): Any alternative exploitation of scene affordances yields either the official solution SS or logically equivalent variants, preventing unintended bypasses.

Formally, if S(R)S^*(R) denotes the solution inferred by a player from representation RR (text, graph, layout, image), GenEscape seeks S=[a1,,ak]S = [a_1, \ldots, a_k]0 such that:

  • S=[a1,,ak]S = [a_1, \ldots, a_k]1
  • S=[a1,,ak]S = [a_1, \ldots, a_k]2
  • For all S=[a1,,ak]S = [a_1, \ldots, a_k]3 reachable under S=[a1,,ak]S = [a_1, \ldots, a_k]4’s affordances, S=[a1,,ak]S = [a_1, \ldots, a_k]5 is invalid or equivalent to S=[a1,,ak]S = [a_1, \ldots, a_k]6 in order and logic.

2. Hierarchical Multi-Agent Architecture

GenEscape operationalizes puzzle generation via a four-stage pipeline, each stage handled by specialized GPT-4o-powered agents passing scene representations and critiques. The architecture comprises:

Stage Agent(s) Representation Output
Functional Design Designer Text + Symbolic Scene Graph Scene description S=[a1,,ak]S = [a_1, \ldots, a_k]7, S=[a1,,ak]S = [a_1, \ldots, a_k]8, solution S=[a1,,ak]S = [a_1, \ldots, a_k]9
Scene Graph Reasoning Player, Examiner Scene Graph kk \leq \ell0 Refined scene graph kk \leq \ell1
Layout Synthesis Builder, Player, Examiner Iconic 2D Layout kk \leq \ell2 Verified sketch and placement refinement
Local Image Editing Builder, Player, Examiner Photorealistic Image kk \leq \ell3 Final image with local affordance edits

In sequence:

  1. Functional Design: The Designer agent produces a textual room description, a candidate solution sequence, and an initial scene graph kk \leq \ell4 capturing objects and attachment relations.
  2. Symbolic Scene Graph Reasoning: The Player agent solves kk \leq \ell5 to infer kk \leq \ell6; the Examiner compares kk \leq \ell7 to the official kk \leq \ell8 and, if discrepancies exist (kk \leq \ell9), instructs graph refinements to block shortcuts or enforce intended logic.
  3. Layout Synthesis: The Builder generates a schematic side-view, icon-labeled sketch TT0 from TT1; Player/Examiner iteratively verify this layout encodes TT2 and not unintended alternatives.
  4. Local Image Editing: The Builder renders TT3 and TT4 to a 1024×1024 full-color image; Player/Examiner detect and correct weak affordances using localized diffusion-based or cross-attention edits, iterating until discrepancies are exhausted.

A shared algorithmic loop, realized in Algorithm 1 (LaTeX pseudocode), governs refinement cycles at every stage.

3. Symbolic Scene Graph Reasoning

Scene graphs in GenEscape are formalized as TT5, where TT6 (TT7: room root, TT8: objects), and edges TT9 represent spatial or containment relationships (e.g., "key hanging from hook"). Graph refinement targets two constraints:

  • Reachability: Paths in O={o1,,om}O = \{o_1, \dots, o_m\}0 must enable the intended action sequence O={o1,,om}O = \{o_1, \dots, o_m\}1.
  • Shortcut Elimination: Alternate graph paths should not permit unintended solution sequences O={o1,,om}O = \{o_1, \dots, o_m\}2.

The multi-agent refinement is as follows: \ell3 Typical graph edits include adding/removing edges, adjusting parent-child relations, or annotating affordance constraints (e.g., "hook is firmly attached," "desk cannot be climbed").

4. Layout Synthesis and Visual Refinement

Layout synthesis tasks the Builder with producing a minimalist, 2D side-view sketch (O={o1,,om}O = \{o_1, \dots, o_m\}3) using textual prompts to GPT-4o—no low-level diffusion model is directly invoked. Layouts employ ASCII- or emoji-style icons, spatially arranged as dictated by O={o1,,om}O = \{o_1, \dots, o_m\}4. Iterative Player/Examiner loops verify that O={o1,,om}O = \{o_1, \dots, o_m\}5 encodes only the intended solution.

In image rendering, the Builder leverages GPT-4o’s "text-to-image" multimodal API to synthesize a photorealistic 1024×1024 image from O={o1,,om}O = \{o_1, \dots, o_m\}6 and O={o1,,om}O = \{o_1, \dots, o_m\}7. If Player/Examiner identify ambiguous or insufficient affordance cues (e.g., key shape unclear, color mismatches), the Examiner issues local edit prompts (e.g., "enlarge the hook’s curve"), and the Builder performs region-specific updates via diffusion or cross-attention.

No explicit loss function is used; optimization is governed by the multi-agent critique-refinement loop at every abstraction level.

5. Quantitative Evaluation

GenEscape’s efficacy is assessed using the following metrics (mean percentages over 15 test scenes, averaged by 10 human annotators):

  • Solvability Rate (O={o1,,om}O = \{o_1, \dots, o_m\}8): Fraction where players infer the official O={o1,,om}O = \{o_1, \dots, o_m\}9.
  • Shortcut Avoidance (\ell0): Fraction where unintended solutions are blocked.
  • Spatial Alignment (\ell1): Fidelity to the input graph.
  • LongCLIP Score (\ell2): Semantic image-text similarity.
  • #Gen Calls: Average number of image generation API calls per scene.

Selected results (Table 1):

System Pipeline Solv (%) Short (%) Align (%) CLIP #Gen
Vanilla GPT-4o 3.3 0.0 N/A
+Description only 6.7 3.3 N/A
+Description+Scene Graph 6.7 13.3 26.7
+…+Layout 10.0 20.0 13.3 13.2
+…+Image editing 20.0 16.7 23.3 15.8
GenEscape (all stages) 53.3 46.6 36.7 0.32 4.5

GenEscape achieves substantial improvements in both solvability and shortcut avoidance versus baseline approaches, while reducing the mean number of image generation calls due to early-stage logical verification (Shan et al., 27 Jun 2025).

6. Contributions, Advantages, and Limitations

Hierarchical decomposition allows GenEscape to enforce puzzle solvability, visual coherence, and affordance clarity via stage-wise specialization and agent-based critique-refinement. The iterative Player-Examiner feedback loop at each abstraction layer ensures solution chains are uniquely encoded by scene affordances prior to computationally expensive rendering.

Key limitations include:

  • Support is restricted to "fully visible" objects; hidden-object puzzles (e.g., requiring interaction with drawers) are not represented.
  • Solution sequences longer than eight steps or involving more than eight objects challenge convergence, with scene graph and layout refinement slowing or failing.
  • Dynamic scene progression (e.g., showing intermediate states after actions) is not supported, limited by GPT-4o’s object-level editing fidelity.

These factors delimit the system’s current scope, but the structured, multi-agent, stage-wise approach demonstrates the viability of bridging text-to-image pipelines with logically coherent, functionally valid escape-room puzzle generation (Shan et al., 27 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GenEscape System.