Papers
Topics
Authors
Recent
2000 character limit reached

CityCraft: A Real Crafter for 3D City Generation (2406.04983v1)

Published 7 Jun 2024 in cs.CV

Abstract: City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a LLM(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.

Citations (7)

Summary

  • The paper introduces CityCraft, a multi-stage framework for generating diverse and high-quality 3D urban environments using a Diffusion Transformer for layout, an LLM for planning, and asset integration.
  • The methodology involves three stages: DiT for controllable 2D layout synthesis, LLM for strategic land-use planning based on text prompts, and Blender with optimized asset placement for 3D scene construction.
  • Evaluations show state-of-the-art layout generation performance (FID 27.60) and high user preference for generated layouts (8.6) and final 3D scenes (9.2), supported by novel CityCraft-OSM and CityCraft-Buildings datasets.

CityCraft introduces a multi-stage framework for generating diverse and high-quality 3D urban environments, addressing limitations in existing two-stage approaches that often suffer from lack of diversity, rendering artifacts, and insufficient planning capabilities (2406.04983). The framework integrates layout generation using a Diffusion Transformer (DiT), land-use planning via a LLM, and scene construction utilizing Blender and a dedicated asset library.

Methodology

The CityCraft framework decomposes the complex task of 3D city generation into three distinct, sequential stages:

Stage 1: Layout Generation with Diffusion Transformer (DiT)

This initial stage focuses on generating 2D semantic city layouts. A Diffusion Transformer (DiT) model is employed, diverging from common VAE, GAN, or standard Transformer approaches. The DiT architecture is leveraged for its known strength in generating high-fidelity and diverse outputs. Key aspects include:

  • Conditional Generation: The DiT model is designed to accept user controls, specifically class-ratios (e.g., percentage of residential vs. commercial areas) and textual prompts, enabling controllable synthesis of layouts that align with specific requirements.
  • Infinite Expansion: The generation process supports the creation of arbitrarily large layouts by iteratively expanding the generated map, facilitating the construction of large-scale urban environments.
  • Input/Output: The model typically takes noise and optional conditioning variables (class labels encoded via adaptive layer normalization or text embeddings) as input and iteratively denoises it to produce a 2D semantic map where different pixel values correspond to different land-use types (e.g., roads, buildings, parks).

Stage 2: Strategic Urban Planning with LLM

Given a generated 2D layout, the second stage utilizes an LLM for strategic land-use planning. This moves beyond simple semantic segmentation to impart functional logic and coherence to the urban space based on textual guidance.

  • Prompt-Based Planning: The LLM takes the 2D layout and user-provided prompts or language guidelines (e.g., "designate a downtown area with high-rise commercial buildings near the main intersection" or "place residential zones away from industrial areas") as input.
  • Spatial-Semantic Analysis: The LLM analyzes the spatial configuration and semantic context of the input layout to propose optimal space usage and functional zoning. It effectively translates high-level textual requirements into specific land-use assignments within the layout structure.
  • Iterative Refinement: An iterative process is employed where the LLM's plan can be refined over multiple rounds. This ensures the coherence, stability, and feasibility of the generated urban plan, resolving potential conflicts or inconsistencies. The necessity of this multi-round refinement was validated through ablation studies.

Stage 3: Asset Integration and Scene Construction

The final stage translates the 2D layout and the corresponding LLM-generated urban plan into a full 3D scene.

  • Asset Retrieval: Based on the planned land use for different regions in the layout, an asset retrieval module selects appropriate 3D models from the purpose-built CityCraft-Buildings dataset. This dataset contains thousands of diverse, high-quality 3D building assets.
  • Precise Placement: Blender is utilized as the 3D modeling and rendering environment. For placing retrieved assets onto the layout, the Powell optimization algorithm is employed. This algorithm determines the optimal scale and rotation for each asset to ensure it fits appropriately within its designated plot boundaries on the 2D layout while maintaining realistic proportions and orientations.
  • Rendering: Once assets are placed, Blender's rendering engine is used to generate the final 3D visualization of the city scene.

Datasets

The development of CityCraft was supported by the creation of two novel datasets:

  1. CityCraft-OSM: This dataset comprises 2D semantic layouts derived from OpenStreetMap (OSM) data, corresponding satellite imagery, and detailed annotations. It serves as the primary training data for the DiT-based layout generator.
  2. CityCraft-Buildings: A collection featuring thousands of diverse, high-quality 3D building assets. This library provides the necessary 3D models for the scene construction stage, enabling the generation of visually rich and varied urban environments.

Experimental Results and Evaluation

CityCraft was evaluated on its layout generation and final 3D scene generation capabilities against existing methods.

  • Layout Generation: The DiT-based layout generator achieved state-of-the-art performance on standard image generation metrics, reporting an FID score of 27.60 and a KID score of 0.022. These results indicate a significant improvement in the realism and diversity of generated layouts compared to prior techniques. User preference studies also heavily favored CityCraft layouts, yielding a score of 8.6.
  • Scene Generation: In terms of geometric accuracy for the final 3D scenes, CityCraft reported a Depth Error (DE) and Camera Error (CE) of 0, suggesting high fidelity in the reconstruction pipeline, although the specific context for these metrics (e.g., comparison setup, reference data) requires clarification from the paper. Qualitative assessments highlighted superior architectural diversity and overall realism. User preference for the final rendered scenes was exceptionally high, reaching 9.2.
  • Ablation Studies: Ablations confirmed the value of conditional generation capabilities (text and ratio control), with users preferring the control offered despite similar FID/KID scores to unconditional models. Ratio control was particularly favored for its precise influence. The studies also validated the necessity of the multi-round refinement process in the LLM planning stage for achieving coherent urban plans.

Improvements Over Existing Methods

CityCraft offers several advantages compared to previous city generation techniques:

  • Enhanced Diversity and Quality: The use of DiT for layout generation and a large, diverse 3D asset library leads to more varied and realistic outputs compared to methods often constrained by the diversity of their training data or limited asset pools.
  • Strategic Planning: Incorporating an LLM for land-use planning introduces a level of semantic understanding and logical organization absent in methods that rely solely on generative models for layout creation without explicit planning.
  • Controllability: The framework provides multiple control points – class-ratios and text prompts for layout generation, and language guidelines for the LLM planner – offering users significant influence over the generated city's characteristics.
  • Integrated Pipeline: It presents an end-to-end solution, from conditional 2D layout generation through planned 3D asset placement and rendering.

In conclusion, CityCraft presents a comprehensive framework for 3D city generation that integrates diffusion models for layout synthesis, LLMs for strategic planning, and procedural techniques with a rich asset library for scene construction. Its quantitative results and user evaluations suggest significant advancements in terms of diversity, realism, and user control compared to prior art, supported by the contribution of two substantial datasets for training and asset provision (2406.04983).

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.