CityCraft Framework: 3D City Generation
- CityCraft is a modular framework for 3D city generation integrating diffusion transformer layout synthesis, LLM-driven land-use planning, and procedural scene assembly.
- It leverages novel datasets, CityCraft-OSM and CityCraft-Buildings, to achieve state-of-the-art performance in urban layout fidelity and scene realism.
- The framework employs multi-stage generative modeling and optimization to overcome limitations in existing systems and enhance diversity and controllability.
CityCraft is a modular framework for realistic, diverse, and controllable 3D city generation, integrating a diffusion transformer layout generator, LLM-driven land-use planning, and high-quality asset-based 3D scene construction. The framework is designed to address limitations in existing city generation systems—including lack of diversity, realism, and planning—by leveraging multi-stage generative modeling, strategic LLM reasoning, and precise procedural scene assembly. CityCraft introduces new datasets (CityCraft-OSM and CityCraft-Buildings) and demonstrates state-of-the-art performance on both image-based and 3D urban realism metrics, significantly improving over prior approaches in controllability and scene fidelity (Deng et al., 2024).
1. System Structure and Workflow
The CityCraft pipeline is composed of three sequential stages, each targeting a major component in synthetic city generation (see Fig. 1 and Algorithm 1 in the supplementary):
- 2D Layout Generation: Inputs are user specifications , where is a text prompt and is a class-ratio vector for semantic classes. The system produces a semantic mask using a diffusion transformer (DiT) model, conditioned by both and .
- LLM-Based Land-Use Planning: The layout is analyzed to isolate connected components (instances), each with spatial and contextual metadata. For each instance , a LLM (GPT-4V or GPT-4) generates a plan specifying land-use function, architectural style, building height, and planner reasoning. The planning process is iterative and multi-round, with re-querying and convergence based on change thresholds.
- Asset Retrieval & Scene Assembly: For each planned instance, the framework retrieves the most semantically relevant 3D model from the CityCraft-Buildings library, using a cross-modal similarity function based on CLIP and SBERT embeddings. Assets are placed precisely in Blender via optimization (Powell’s method) and procedural rules, constructing a 3D scene graph with hierarchy from zones down to furniture.
A condensed pseudo-code outlining the end-to-end pipeline is provided in the supplementary material (Sec. A.1, Alg. 1), reflecting the above stages.
2. Diffusion Transformer for Layout Synthesis
The layout generator is based on the DiT-B/2 architecture, a Transformer U-Net that operates in VAE-encoded latent space. Layout masks of size are encoded via SDXL-VAE, denoised through DDPM steps, and decoded for reconstruction.
- Conditioning: The class-ratio vector , normalized so , and a text prompt (captioned from satellite imagery via GPT-4V) are mapped to FiLM layer-norm parameters in each transformer block through an MLP.
- Diffusion Objective: The process follows DDPM formulation:
with the loss:
using a cosine variance schedule.
- Infinite Expansion: The outpainting mechanism, based on BlendedDiffusion, enables iterative expansion of layouts allowing for the generation of unbounded cities with seamless region stitching.
3. LLM-Driven Urban Planning
For each instance (e.g., building footprint) detected in the semantic layout , an information dictionary is constructed, containing attributes such as area, bounding box, centroid, distance to nearest roads, and identified neighbors.
- Prompting Template: The LLM receives descriptive prompts structured by instance metadata, requesting predictions for primary/secondary function, style, height, and justifications.
- Multi-Round Refinement: Initial LLM outputs may lack context from incomplete neighbor information, so refinement rounds are performed. The proportion of changed plans is monitored, and iterations terminate when below a predefined threshold (e.g., $0.05$).
- Constraint Enforcement: Post-processing applies spatial constraints, including non-overlap, bounding-box snapping, and enforcement of minimum spacing rules for plausible urban layouts.
4. Asset Retrieval, Placement, and Scene Construction
Asset selection utilizes the CityCraft-Buildings dataset, each entry annotated with 12-view CLIP embeddings and SBERT-encoded metadata (style, floors, function).
- Retrieval Pipeline: Candidates are pre-filtered by function and approximate scale. The multimodal similarity score is:
where are CLIP/SBERT embeddings and are modality weights (Eq. 1).
- Placement Optimization: 2D mask regions map to Blender world coordinates. For each asset, scale and rotation are chosen to minimize the discrepancy between the projected mesh footprint and the semantic mask region, using Powell’s optimization method.
- Scene Assembly: Hierarchical scene graphs are built, incorporating adjacency snapping (for block coherence), asset placement, texturing, and integration of environmental features (vegetation, furniture). The result is a rendered 3D city tile.
5. Dataset Contributions
CityCraft introduces two open datasets to the community:
| Dataset | Content and Use | Key Properties |
|---|---|---|
| CityCraft-OSM | 2D semantic layouts from OSM, paired satellite images, human-corrected captions | 100,000 patches (768×768 at 0.5 m/pixel); 7 classes; spans America, Europe, Asia |
| CityCraft-Buildings | 3D building assets with multi-view renders, text annotations | 2,000 Sketchfab models, 12 views each |
CityCraft-OSM is used for training and evaluation of the DiT layout module, while CityCraft-Buildings supports retrieval and compositional realism during scene generation.
6. Training Protocols and Inference Bottlenecks
- Layout Generation: The DiT-B/2 model is trained on 8×4090 GPUs. Key hyperparameters include AdamW optimizer (, weight decay $0.01$), linear warmup to , cosine decay, batch size 256, $1$M steps. Layout sampling uses 20 DDIM steps (12s per tile).
- LLM Planning: GPT-4V and GPT-4 are used in zero-shot mode (prompting only); convergence is typically achieved in 3–5 rounds, each instance requiring 0.5s per round.
- Asset Retrieval: OpenCLIP (ViT-L/14) and SBERT embeddings are indexed in Faiss for 1s nearest-neighbor lookup.
- Scene Construction: Placement optimization averages $0.2$s per instance, Blender scripting 2s per patch, yielding 30s end-to-end per city tile.
7. Evaluation, Ablations, and Limitations
Quantitative and qualitative benchmarks (see Tables 2–5, §5.2–5.4, §A.4) establish that CityCraft produces layouts and scenes of substantially higher fidelity and diversity than prior city generators:
- 2D Layout Quality: FID/KID for CityCraft (27.60/0.022) significantly improve over CityGen (88.38/0.089). User preference scores (1–10) also favor CityCraft (8.6 vs 7.5).
- 3D Scene Realism: Multi-view consistency shows zero depth/camera error for CityCraft, outperforming CityDreamer, with higher user-rated preference (9.2 vs 7.6).
- Ablation Results: Ratio control yields best observed diversity and accuracy (ACE); qualitative steering and semantic compliance are verified.
- User Study: 22 participants rated fidelity, style, controllability, clarity, sharpness, and overall quality.
Documented limitations include insufficient diversity of asset types (no vehicles or street furniture), occasional need for heuristics due to LLM-planned physical implausibility, visible seams from outpainting, and lack of dynamic elements (e.g., traffic simulation). Proposed future directions include asset library expansion, dynamic simulation, regulatory constraints, holistic road network optimization, and an end-to-end differentiable pipeline.
CityCraft constitutes a substantive advance in synthetic city generation, providing a rigorously benchmarked, dataset-driven, and highly modular system for 3D urban content creation (Deng et al., 2024).