Auto-Regressive Tile Generation
- Auto-regressive tile generation is a sequential method that builds grid-structured content one tile at a time using local or global context, enabling high diversity and procedural flexibility.
- Recent advances integrate pretrained diffusion models and supervised convolutional networks to perform boundary inpainting, achieving improved metrics such as higher CLIP-IQA scores and semantic consistency.
- Methodologies span continuous image synthesis and discrete game level repair (PoD), where trade-offs between generation speed and boundary fidelity are critical for optimal performance.
Auto-regressive tile generation encompasses a class of methodologies for synthesizing grid-structured content—such as 2D images or game levels—by generating tiles or cells one at a time, with each new tile conditioned on a local or global context of previously generated tiles. This paradigm enables flexible procedural generation, high diversity, and content-aware coherence, and has become central in texture synthesis, level generation, and inpainting tasks. Recent advances leverage pretrained diffusion models and supervised convolutional networks within rigorously defined auto-regressive factorization frameworks, supporting sophisticated tiling schemes and diversity control in both continuous (images) and discrete (tile types) domains (Sartor et al., 2024, Siper et al., 2022).
1. Formalization and General Principles
Auto-regressive tile generation models the synthesis of a grid by defining a sequential process where the value (content) of each tile is generated conditioned on the current (partial) state of the grid. Denoting the set of possible tile values or types by , and the partial grid at step by , tile generation is factored as
where is the location of the tile to be generated, and extracts the relevant context around location (local patch, boundary, or other conditioning information). This factorization underpins both image-based and discrete tile-based instantiations (Sartor et al., 2024, Siper et al., 2022).
In image-to-image applications, the context typically includes the boundaries or neighboring pixel strips from previously filled tiles, potentially padded by a context width . For discrete tiles, context is usually a 0 window, one-hot encoded.
2. Diffusion-Based Auto-Regressive Image Tile Generation
The content-aware tile generation framework introduced by "Content-aware Tile Generation using Exterior Boundary Inpainting" (Sartor et al., 2024) realizes auto-regressive tile synthesis as a sequence of masked inpainting problems, leveraging pretrained, text-conditioned diffusion models without retraining. An 1 grid of 2 pixel tiles is synthesized in a fixed scan order (e.g., row by row). For each step 3:
- Known boundary pixels from synthesized neighbors (left and above) are assembled.
- The boundary context is extended outward by a width 4.
- A binary mask 5 is set to select the 6 interior of the target tile.
- A pretrained diffusion inpainting model (e.g., Stable Diffusion Inpainting U-Net) is called with the tuple 7 to synthesize the interior.
Formally, per-tile generation solves for the distribution:
8
where 9 is the unknown interior and 0 is the fixed boundary. The generation objective is
1
where 2, and the classifier-free guidance term 3 is imposed via the U-Net. Sampling proceeds along a reverse diffusion chain, with boundary pixels hard-clipped at every denoising step. No finetuning of the pretrained model is performed.
3. Model-Free Approaches for Discrete Tile Generation
The Path of Destruction (PoD) method described in "Path of Destruction: Learning an Iterative Level Generator Using a Small Dataset" (Siper et al., 2022) adapts auto-regressive tile generation to discrete 2D level layouts. Here, a convolutional network is trained on (observation, action) pairs produced by iteratively "destroying" designer levels—mutating one tile at a time toward random noise—and supervising the model to "repair" each tile based on a 4 patch centered on it.
Formally, levels are represented as 5. Training proceeds by generating destruction trajectories:
- For each (goal, start) pair, repeatedly copy StartLevel tiles into GoalLevel, at each step recording 6.
- The network is trained to predict the correct tile at a given location, conditional on its local context.
At generation time, PoD samples a starting random level and repairs it tile by tile, sampling from 7 or using the 8. Output levels are optionally filtered for domain-specific constraints such as playability.
4. Tiling Schemes and Boundary Conditioning
Auto-regressive tile generation enables a variety of advanced compositional schemes:
- Self-tiling: Horizontal and vertical boundary strips from a synthetic or exemplar image are split and attached to opposite edges, ensuring seamless tiling along both axes.
- Stochastic self-tiling: Multiple interiors are generated per boundary condition using random seeds, increasing diversity and reducing repetition at tiling time.
- Escher-shaped tiles: Instead of axis-aligned strips, arbitrary jagged boundary cuts are allowed, with inpainting handling boundary complexity.
- Wang tiles (C colors): Tiles indexed by four edge labels 9, with each boundary constructed from corresponding patch halves, enabling 0 mutually compatible tiles.
- Dual Wang tiles: Constructs a set of diamond- and cross-shaped tiles via inpainting around diamond-edge contexts, yielding higher texture continuity and greater diversity.
Boundary assembly is performed per tile before invoking the inpainting or repair model. In all cases, no pixels from the exemplar or boundary source are copied to the final generated tile.
5. Training, Inference, and Algorithmic Pipelines
In image-based pipelines, pretrained diffusion models (e.g., Stable Diffusion Inpainting U-Net) are used "as is," operated in latent space. The mask 1 is appended as a fourth channel; the text prompt is embedded via CLIP and injected into cross-attention layers. Each tile generation comprises typically 40 denoising steps (guidance scale ≈7.5), with necessary hard-clipping of known pixels at every step.
Candidate quality can be further improved by generating multiple candidate tile interiors per boundary condition (distinct seeds), scoring candidates by no-reference image quality assessment (e.g., CLIP-IQA), and selecting the best.
PoD-style networks for discrete tiles are trained with categorical cross-entropy. Architecture includes 2–3 convolutional layers with ReLU, optional max pooling, and a final softmax. No dropout or weight decay is used.
A general high-level pseudocode outline for auto-regressive image tile generation is as follows (Sartor et al., 2024):
4
6. Empirical Performance, Metrics, and Trade-offs
Image-based approaches report per-tile generation times of ≈1.7 s (40 denoise steps) on an NVIDIA A5000. Complete Wang tile sets (C=3, 81 tiles) require ≈140 s (or ≈12 min with 4× candidate rejection); Dual Wang sets (243 tiles) require ≈7 min for direct synthesis or ≈36 min with candidate rejection. Projected speed improvements scale with GPU parallelization and faster diffusion samplers.
Quality and diversity are measured by CLIPScore (semantic consistency), CLIP-IQA (perceptual quality), and pairwise Inception-feature correlation (diversity). Inpainting-based Wang sets outperform previous graph-cut approaches (CLIP-IQA +0.02, feature correlation −0.8 %), with Dual Wang variants achieving maximal diversity and visual quality.
For PoD, evaluation on game levels uses Playability % (fraction satisfying game constraints) and Uniqueness % (Hamming distance to any known solution). For 2, 3, results are Playable ≈38.0 %, Unique ≈37.8 %. Other configurations and baselines are rigorously reported (Siper et al., 2022).
Trade-offs are observed between speed (smaller context, fewer steps) and boundary-continuity/feature preservation. For example, small inpainting masks are faster but degrade edge continuity, while exterior-boundary inpainting achieves sharp, semantically consistent features up to tile edges.
| Approach | Task Domain | Generation Time | Diversity | Semantic Quality |
|---|---|---|---|---|
| Diffusion inpainting (Wang) | Image (Tiling) | ≈1.7 s/tile | Highest | Highest |
| PoD | Game Level (Tiles) | dataset/model dependent | Contextual | Playability-based |
7. Relationship to Other Methods and Implications
Auto-regressive tile generation generalizes traditional procedural and example-based methods—including Markov chain tile predictors, WaveFunctionCollapse, and GAN-based (CESAGAN) models—by offering principled modularity and context-conditioning. In image synthesis, it leverages pretrained generative priors and cross-modal (text/image) guidance for flexible, semantic, and diverse generation (Sartor et al., 2024). For discrete domains, supervised auto-regression enables few-shot generalization from small datasets and strong contextual localization (Siper et al., 2022).
A plausible implication is that, by decoupling structure from content and employing explicit boundary guidance, auto-regressive pipelines can support arbitrary and novel tiling schemes (such as Dual Wang), while avoiding explicit copying and preserving maximum diversity. This approach immediately generalizes to new domains and supports extension to higher-order or irregular tiling geometries.
Further, the modularity of per-tile generation admits mixed-initiative content creation, direct quality control, and ease of parallelization. This suggests broad applicability in both artistic and programmatic content generation, with controllable trade-offs among speed, diversity, and boundary fidelity.