- The paper introduces UrbanWorld, the first generative urban world model capable of automatically creating realistic, customizable, and interactive embodied 3D urban environments with flexible controls for training AI agents.
- UrbanWorld employs a four-stage pipeline using OSM data, fine-tuned Multimodal Large Language Models (MLLMs) for design and refinement, and controllable diffusion models for high-fidelity texture generation.
- Qualitative and quantitative evaluations show UrbanWorld generates more diverse and realistic 3D urban scenes than existing methods, achieving improved metrics like 39.5% lower depth error and 11.8% higher realistic score, and is provided as an open-source tool.
UrbanWorld: A Generative Model for 3D Urban Environments
UrbanWorld addresses the need for realistic, interactive 3D urban environments for training AI agents by introducing a generative urban world model (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The model aims to reduce the manual labor involved in creating high-fidelity urban environments, which is traditionally a significant bottleneck. UrbanWorld is designed to generate customizable, controllable urban environments suitable for embodied agent learning, thereby facilitating advancements in embodied intelligence and AGI.
Methodology of UrbanWorld
The methodology is structured around a four-stage pipeline: OSM-guided urban layout generation, MLLM-empowered urban scene design, controllable diffusion-based urban asset texture rendering, and MLLM-assisted urban scene refinement.
OSM-Guided Urban Layout Generation
This stage leverages OpenStreetMap (OSM) data to automate the generation of 3D urban layouts (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). OSM data, which includes geographic locations and attributes of urban elements such as roads, buildings, and vegetation, is processed using Blender to create independent 3D objects. The center locations of these objects are recorded for subsequent asset reorganization, which enhances the efficiency of embodied environment construction.
MLLM-Empowered Urban Scene Design
A fine-tuned, urban-specific Multimodal LLM (Urban MLLM) is employed to plan and design urban scenes (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). Trained on a large dataset of urban street-view imagery with corresponding textual descriptions, the Urban MLLM generates detailed textual descriptions of urban elements based on user instructions and the OSM layout. This mimics the cognitive processes of human designers, ensuring visually coherent urban scenes. The MLLM is based on LLaVA-1.5 and fine-tuned on approximately 100,000 image-text pairs.
Controllable Diffusion-Based Urban Asset Texture Rendering
This stage focuses on generating high-quality textures for the 3D assets using a controllable diffusion-based method involving UV texture generation and texture refinement (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The method utilizes depth-aware ControlNet to control a 2D diffusion model, enabling the generation of images from different perspectives based on textual and visual prompts. UV position-aware texture refinement, also based on a diffusion model and ControlNet, is used to inpaint untextured areas, ensuring complete and natural textures.
MLLM-Assisted Urban Scene Refinement
In the final stage, the Urban MLLM scrutinizes the generated 3D urban scenes to identify inconsistencies and areas for improvement (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The MLLM provides suggestions for refinement, and the rendering module is reactivated to update the scene based on the refined prompts. This iterative refinement process mimics the standard operation of human designers, ensuring alignment with real-world urban environments.
Key Contributions of UrbanWorld
UrbanWorld makes several key contributions:
Results and Performance Metrics
The paper presents both qualitative and quantitative results to validate the effectiveness of UrbanWorld. Visual comparisons with existing methods such as Infinicity, CityGen, and CityDreamer show that UrbanWorld generates more diverse 3D urban scenes with higher fidelity textures and better adherence to user instructions (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The paper asserts that UrbanWorld overcomes the limitations of existing methods, such as unclear textures, homogeneous styles, geometric distortions, and the lack of distinct urban functional characteristics.
Quantitative Metrics
The paper introduces three metrics for quantitative evaluation (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024):
- Depth Error (DE): Measures the 3D geometry accuracy. UrbanWorld achieves a lower depth error compared to baselines.
- Homogeneity Index: Measures the diversity of generated scenes. UrbanWorld achieves a lower homogeneity index, indicating higher diversity.
- Realistic Score: Evaluates the realness and quality of generated urban elements using GPT-4o. UrbanWorld achieves a higher realistic score compared to baselines.
UrbanWorld achieved a 39.5% improvement in depth error and an 8.3% improvement in the homogeneity index. It also increased the realistic score by 11.8% (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024).
Urban Generation Using Multistage Generation and MLLM-Assisted Refinement
UrbanWorld leverages multistage generation and MLLM-assisted refinement to overcome the limitations of existing methods for 3D urban generation. The four-stage pipeline allows for a modular and controllable approach to urban scene generation, with each stage focusing on a specific aspect of the process, from layout generation to texture rendering and refinement (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). This enables more fine-grained control over the final result.
MLLM Integration
The integration of the Urban MLLM is critical due to its training on urban street-view imagery, which enables it to understand and reason about urban scenes effectively (UrbanWorld: An Urban World Model for 3D City Generation, 16 Jul 2024). The MLLM generates detailed textual descriptions of urban elements, ensuring visual coherence and adherence to user instructions, functioning as an AI designer. Additionally, the MLLM's ability to scrutinize generated scenes and provide suggestions for improvement allows for an iterative refinement process that mimics the human design process. The controllable nature of diffusion models is exploited to create assets guided by textual and visual prompts. The two-stage approach to texture rendering and the use of ControlNet allow for greater control over the appearance of the assets.
In conclusion, UrbanWorld introduces a novel approach to 3D urban generation by combining OSM data, MLLMs, and diffusion models within a structured, multistage pipeline. The MLLM-assisted refinement process enhances the quality and realism of the generated urban environments, facilitating realistic and interactive simulations for AI research and applications.