Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UrbanWorld: An Urban World Model for 3D City Generation (2407.11965v2)

Published 16 Jul 2024 in cs.CV

Abstract: Cities, as the essential environment of human life, encompass diverse physical elements such as buildings, roads and vegetation, which continuously interact with dynamic entities like people and vehicles. Crafting realistic, interactive 3D urban environments is essential for nurturing AGI systems and constructing AI agents capable of perceiving, decision-making, and acting like humans in real-world environments. However, creating high-fidelity 3D urban environments usually entails extensive manual labor from designers, involving intricate detailing and representation of complex urban elements. Therefore, accomplishing this automatically remains a longstanding challenge. Toward this problem, we propose UrbanWorld, the first generative urban world model that can automatically create a customized, realistic and interactive 3D urban world with flexible control conditions. UrbanWorld incorporates four key stages in the generation pipeline: flexible 3D layout generation from OSM data or urban layout with semantic and height maps, urban scene design with Urban MLLM, controllable urban asset rendering via progressive 3D diffusion, and MLLM-assisted scene refinement. We conduct extensive quantitative analysis on five visual metrics, demonstrating that UrbanWorld achieves SOTA generation realism. Next, we provide qualitative results about the controllable generation capabilities of UrbanWorld using both textual and image-based prompts. Lastly, we verify the interactive nature of these environments by showcasing the agent perception and navigation within the created environments. We contribute UrbanWorld as an open-source tool available at https://github.com/Urban-World/UrbanWorld.

Citations (2)

Summary

  • The paper introduces UrbanWorld, the first generative urban world model capable of automatically creating realistic, customizable, and interactive embodied 3D urban environments with flexible controls for training AI agents.
  • UrbanWorld employs a four-stage pipeline using OSM data, fine-tuned Multimodal Large Language Models (MLLMs) for design and refinement, and controllable diffusion models for high-fidelity texture generation.
  • Qualitative and quantitative evaluations show UrbanWorld generates more diverse and realistic 3D urban scenes than existing methods, achieving improved metrics like 39.5% lower depth error and 11.8% higher realistic score, and is provided as an open-source tool.

UrbanWorld: A Generative Model for 3D Urban Environments

UrbanWorld addresses the need for realistic, interactive 3D urban environments for training AI agents by introducing a generative urban world model (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The model aims to reduce the manual labor involved in creating high-fidelity urban environments, which is traditionally a significant bottleneck. UrbanWorld is designed to generate customizable, controllable urban environments suitable for embodied agent learning, thereby facilitating advancements in embodied intelligence and AGI.

Methodology of UrbanWorld

The methodology is structured around a four-stage pipeline: OSM-guided urban layout generation, MLLM-empowered urban scene design, controllable diffusion-based urban asset texture rendering, and MLLM-assisted urban scene refinement.

OSM-Guided Urban Layout Generation

This stage leverages OpenStreetMap (OSM) data to automate the generation of 3D urban layouts (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). OSM data, which includes geographic locations and attributes of urban elements such as roads, buildings, and vegetation, is processed using Blender to create independent 3D objects. The center locations of these objects are recorded for subsequent asset reorganization, which enhances the efficiency of embodied environment construction.

MLLM-Empowered Urban Scene Design

A fine-tuned, urban-specific Multimodal LLM (Urban MLLM) is employed to plan and design urban scenes (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). Trained on a large dataset of urban street-view imagery with corresponding textual descriptions, the Urban MLLM generates detailed textual descriptions of urban elements based on user instructions and the OSM layout. This mimics the cognitive processes of human designers, ensuring visually coherent urban scenes. The MLLM is based on LLaVA-1.5 and fine-tuned on approximately 100,000 image-text pairs.

Controllable Diffusion-Based Urban Asset Texture Rendering

This stage focuses on generating high-quality textures for the 3D assets using a controllable diffusion-based method involving UV texture generation and texture refinement (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The method utilizes depth-aware ControlNet to control a 2D diffusion model, enabling the generation of images from different perspectives based on textual and visual prompts. UV position-aware texture refinement, also based on a diffusion model and ControlNet, is used to inpaint untextured areas, ensuring complete and natural textures.

MLLM-Assisted Urban Scene Refinement

In the final stage, the Urban MLLM scrutinizes the generated 3D urban scenes to identify inconsistencies and areas for improvement (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The MLLM provides suggestions for refinement, and the rendering module is reactivated to update the scene based on the refined prompts. This iterative refinement process mimics the standard operation of human designers, ensuring alignment with real-world urban environments.

Key Contributions of UrbanWorld

UrbanWorld makes several key contributions:

Results and Performance Metrics

The paper presents both qualitative and quantitative results to validate the effectiveness of UrbanWorld. Visual comparisons with existing methods such as Infinicity, CityGen, and CityDreamer show that UrbanWorld generates more diverse 3D urban scenes with higher fidelity textures and better adherence to user instructions (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The paper asserts that UrbanWorld overcomes the limitations of existing methods, such as unclear textures, homogeneous styles, geometric distortions, and the lack of distinct urban functional characteristics.

Quantitative Metrics

The paper introduces three metrics for quantitative evaluation (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024):

  • Depth Error (DE): Measures the 3D geometry accuracy. UrbanWorld achieves a lower depth error compared to baselines.
  • Homogeneity Index: Measures the diversity of generated scenes. UrbanWorld achieves a lower homogeneity index, indicating higher diversity.
  • Realistic Score: Evaluates the realness and quality of generated urban elements using GPT-4o. UrbanWorld achieves a higher realistic score compared to baselines.

UrbanWorld achieved a 39.5% improvement in depth error and an 8.3% improvement in the homogeneity index. It also increased the realistic score by 11.8% (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024).

Urban Generation Using Multistage Generation and MLLM-Assisted Refinement

UrbanWorld leverages multistage generation and MLLM-assisted refinement to overcome the limitations of existing methods for 3D urban generation. The four-stage pipeline allows for a modular and controllable approach to urban scene generation, with each stage focusing on a specific aspect of the process, from layout generation to texture rendering and refinement (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). This enables more fine-grained control over the final result.

MLLM Integration

The integration of the Urban MLLM is critical due to its training on urban street-view imagery, which enables it to understand and reason about urban scenes effectively (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The MLLM generates detailed textual descriptions of urban elements, ensuring visual coherence and adherence to user instructions, functioning as an AI designer. Additionally, the MLLM's ability to scrutinize generated scenes and provide suggestions for improvement allows for an iterative refinement process that mimics the human design process. The controllable nature of diffusion models is exploited to create assets guided by textual and visual prompts. The two-stage approach to texture rendering and the use of ControlNet allow for greater control over the appearance of the assets.

In conclusion, UrbanWorld introduces a novel approach to 3D urban generation by combining OSM data, MLLMs, and diffusion models within a structured, multistage pipeline. The MLLM-assisted refinement process enhances the quality and realism of the generated urban environments, facilitating realistic and interactive simulations for AI research and applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com