CityCraft Framework: 3D City Generation

Updated 3 January 2026

CityCraft is a modular framework for 3D city generation integrating diffusion transformer layout synthesis, LLM-driven land-use planning, and procedural scene assembly.
It leverages novel datasets, CityCraft-OSM and CityCraft-Buildings, to achieve state-of-the-art performance in urban layout fidelity and scene realism.
The framework employs multi-stage generative modeling and optimization to overcome limitations in existing systems and enhance diversity and controllability.

CityCraft is a modular framework for realistic, diverse, and controllable 3D city generation, integrating a diffusion transformer layout generator, LLM-driven land-use planning, and high-quality asset-based 3D scene construction. The framework is designed to address limitations in existing city generation systems—including lack of diversity, realism, and planning—by leveraging multi-stage generative modeling, strategic LLM reasoning, and precise procedural scene assembly. CityCraft introduces new datasets (CityCraft-OSM and CityCraft-Buildings) and demonstrates state-of-the-art performance on both image-based and 3D urban realism metrics, significantly improving over prior approaches in controllability and scene fidelity (Deng et al., 2024).

1. System Structure and Workflow

The CityCraft pipeline is composed of three sequential stages, each targeting a major component in synthetic city generation (see Fig. 1 and Algorithm 1 in the supplementary):

2D Layout Generation: Inputs are user specifications $X = \{ t, R \}$ , where $t$ is a text prompt and $R \in \mathbb{R}^C$ is a class-ratio vector for $C$ semantic classes. The system produces a semantic mask $L \in \{1, \dots, C\}^{H\times W}$ using a diffusion transformer (DiT) model, conditioned by both $t$ and $R$ .
LLM-Based Land-Use Planning: The layout $L$ is analyzed to isolate connected components (instances), each with spatial and contextual metadata. For each instance $i$ , a LLM (GPT-4V or GPT-4) generates a plan $P_L[i]$ specifying land-use function, architectural style, building height, and planner reasoning. The planning process is iterative and multi-round, with re-querying and convergence based on change thresholds.
Asset Retrieval & Scene Assembly: For each planned instance, the framework retrieves the most semantically relevant 3D model from the CityCraft-Buildings library, using a cross-modal similarity function based on CLIP and SBERT embeddings. Assets are placed precisely in Blender via optimization (Powell’s method) and procedural rules, constructing a 3D scene graph with hierarchy from zones down to furniture.

A condensed pseudo-code outlining the end-to-end pipeline is provided in the supplementary material (Sec. A.1, Alg. 1), reflecting the above stages.

2. Diffusion Transformer for Layout Synthesis

The layout generator is based on the DiT-B/2 architecture, a Transformer U-Net that operates in VAE-encoded latent space. Layout masks of size $768\times 768$ are encoded via SDXL-VAE, denoised through $T$ DDPM steps, and decoded for reconstruction.

Conditioning: The class-ratio vector $R_L$ , normalized so $\sum_i R_L[i]=1$ , and a text prompt $t_L$ (captioned from satellite imagery via GPT-4V) are mapped to FiLM layer-norm parameters in each transformer block through an MLP.
Diffusion Objective: The process follows DDPM formulation:

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t-1}, \beta_t I)$

$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, \text{cond}), \tilde{\beta}_t I)$

with the loss:

$\mathcal{L} = \mathbb{E}_{x, \epsilon \sim \mathcal{N}(0, I), t} [w_t \| \epsilon - \epsilon_\theta(\alpha_t x + \sigma_t \epsilon, t, \text{cond}) \|^2]$

using a cosine variance schedule.

Infinite Expansion: The outpainting mechanism, based on BlendedDiffusion, enables iterative expansion of layouts allowing for the generation of unbounded cities with seamless region stitching.

3. LLM-Driven Urban Planning

For each instance (e.g., building footprint) detected in the semantic layout $L$ , an information dictionary $D_L[i]$ is constructed, containing attributes such as area, bounding box, centroid, distance to nearest roads, and identified neighbors.

Prompting Template: The LLM receives descriptive prompts structured by instance metadata, requesting predictions for primary/secondary function, style, height, and justifications.
Multi-Round Refinement: Initial LLM outputs may lack context from incomplete neighbor information, so $T$ refinement rounds are performed. The proportion of changed plans $C_t/|\text{instances}|$ is monitored, and iterations terminate when below a predefined threshold $\tau$ (e.g., $0.05$).
Constraint Enforcement: Post-processing applies spatial constraints, including non-overlap, bounding-box snapping, and enforcement of minimum spacing rules for plausible urban layouts.

4. Asset Retrieval, Placement, and Scene Construction

Asset selection utilizes the CityCraft-Buildings dataset, each entry annotated with 12-view CLIP embeddings and SBERT-encoded metadata (style, floors, function).

Retrieval Pipeline: Candidates are pre-filtered by function and approximate scale. The multimodal similarity score is:

$\sigma(A, P) = \sum_{m=1}^M w_m \cdot \sigma_m(E_m^A, E_m^P)$

where $E_m$ are CLIP/SBERT embeddings and $w_m$ are modality weights (Eq. 1).

Placement Optimization: 2D mask regions map to Blender world coordinates. For each asset, scale $s$ and rotation $\theta$ are chosen to minimize the $L_2$ discrepancy between the projected mesh footprint and the semantic mask region, using Powell’s optimization method.
Scene Assembly: Hierarchical scene graphs are built, incorporating adjacency snapping (for block coherence), asset placement, texturing, and integration of environmental features (vegetation, furniture). The result is a rendered 3D city tile.

5. Dataset Contributions

CityCraft introduces two open datasets to the community:

Dataset	Content and Use	Key Properties
CityCraft-OSM	2D semantic layouts from OSM, paired satellite images, human-corrected captions	100,000 patches (768×768 at 0.5 m/pixel); 7 classes; spans America, Europe, Asia
CityCraft-Buildings	3D building assets with multi-view renders, text annotations	2,000 Sketchfab models, 12 views each

CityCraft-OSM is used for training and evaluation of the DiT layout module, while CityCraft-Buildings supports retrieval and compositional realism during scene generation.

6. Training Protocols and Inference Bottlenecks

Layout Generation: The DiT-B/2 model is trained on 8×4090 GPUs. Key hyperparameters include AdamW optimizer ( $\beta_1=0.9, \beta_2=0.999$ , weight decay $0.01$), linear warmup to $5\times 10^{-5}$ , cosine decay, batch size 256, $1$M steps. Layout sampling uses 20 DDIM steps ( $\sim$ 12s per tile).
LLM Planning: GPT-4V and GPT-4 are used in zero-shot mode (prompting only); convergence is typically achieved in 3–5 rounds, each instance requiring $\sim$ 0.5s per round.
Asset Retrieval: OpenCLIP (ViT-L/14) and SBERT embeddings are indexed in Faiss for $<$ 1s nearest-neighbor lookup.
Scene Construction: Placement optimization averages $0.2$s per instance, Blender scripting $\sim$ 2s per patch, yielding $\sim$ 30s end-to-end per city tile.

7. Evaluation, Ablations, and Limitations

Quantitative and qualitative benchmarks (see Tables 2–5, §5.2–5.4, §A.4) establish that CityCraft produces layouts and scenes of substantially higher fidelity and diversity than prior city generators:

2D Layout Quality: FID/KID for CityCraft (27.60/0.022) significantly improve over CityGen (88.38/0.089). User preference scores (1–10) also favor CityCraft (8.6 vs 7.5).
3D Scene Realism: Multi-view consistency shows zero depth/camera error for CityCraft, outperforming CityDreamer, with higher user-rated preference (9.2 vs 7.6).
Ablation Results: Ratio control yields best observed diversity and accuracy (ACE); qualitative steering and semantic compliance are verified.
User Study: 22 participants rated fidelity, style, controllability, clarity, sharpness, and overall quality.

Documented limitations include insufficient diversity of asset types (no vehicles or street furniture), occasional need for heuristics due to LLM-planned physical implausibility, visible seams from outpainting, and lack of dynamic elements (e.g., traffic simulation). Proposed future directions include asset library expansion, dynamic simulation, regulatory constraints, holistic road network optimization, and an end-to-end differentiable pipeline.

CityCraft constitutes a substantive advance in synthetic city generation, providing a rigorously benchmarked, dataset-driven, and highly modular system for 3D urban content creation (Deng et al., 2024).

PDF Markdown Chat (Pro)

References (1)

CityCraft: A Real Crafter for 3D City Generation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CityCraft Framework.