Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion (2407.13759v2)

Published 18 Jul 2024 in cs.CV and cs.GR

Abstract: We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.

Citations (5)

View on Semantic Scholar

Summary

The paper presents an autoregressive video diffusion approach that maintains visual coherence in long urban street view sequences.
It leverages conditional generation with overhead layout data and temporal imputation techniques to control scene configurations accurately.
Evaluations show significantly lower FID and KID scores compared to rivals, indicating its promise for virtual urban planning and AR applications.

Overview of Streetscapes Paper

The paper "Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion" introduces a novel approach for generating coherent and realistic sequences of street views across extensive urban scenes. Leveraging a combination of video diffusion models, control mechanisms based on scene layouts, and innovative autoregressive techniques, the proposed system, termed "Streetscapes," is capable of synthesizing street views that maintain visual quality and consistency over several city blocks.

Core Contributions

Long-range Urban Scene Generation: Existing generative models often falter when tasked with producing extensive, coherent outputs. The Streetscapes framework addresses this limitation by synthesizing consistent sequences of street views along user-defined pathways. The generated sequences exhibit high-quality and maintain visual coherence across extensive trajectories.
Autoregressive Video Diffusion (AVD): The authors introduce a new temporal imputation technique within an autoregressive framework, enhancing the model's capability to preserve consistency and realism over long sequences. This imputation mechanism is critical in preventing the generated imagery from drifting away from the manifold of realistic city imagery.
Conditional Generation: Streetscapes is conditioned on overhead scene layouts, including street maps and corresponding height maps, allowing fine-grained control over the generated sequences. This conditioning provides users with the ability to specify various aspects of the output such as city layout, camera trajectory, and even specific characteristics like geographic styles and weather conditions via text prompts.
Comprehensive Dataset: The model is trained on a substantial dataset comprising 1.5 million images from Google Street View, spanning four major cities. This extensive training data includes both posed imagery and corresponding map data, enabling the generation of diverse urban environments.

Methodology

The core of the Streetscapes system is a video diffusion model integrated with a ControlNet for layout and camera pose control. The ControlNet processes scene layout information, transforming it from map space into screen space through G-buffers. This transformation ensures that the generated images align with the desired scene configurations. Additionally, the model incorporates a two-frame generation module adapted from AnimateDiff to facilitate joint generation of consecutive frames, ensuring temporal coherence.

The paper introduces a robust autoregressive mechanism to enable long-range video generation. By utilizing a temporal imputation method, the model iteratively generates frames while conditioning on preceding frames. This method ensures that the generated sequence remains coherent and realistic over extended trajectories. The temporal imputation approach is further enhanced by techniques such as Resample and WarpInit, which help to mitigate inconsistencies and enforce continuity.

Results and Evaluation

The authors present a comprehensive evaluation of the Streetscapes system, demonstrating its superiority over state-of-the-art methods like InfiniCity and InfiniteNature-Zero (InfNat0) in generating realistic and consistent street views. Both qualitative and quantitative assessments highlight the system's ability to maintain high visual fidelity across long sequences.

Quantitative results are particularly compelling, with the model achieving significantly lower FID and KID scores compared to competing methods, indicating a closer match to the distribution of real street view images. The evaluation also includes a thorough analysis of near-range accuracy and long-range quality, underscoring the effectiveness of the temporal imputation technique in preserving realism over extended video sequences.

Implications and Future Directions

The Streetscapes system represents a significant advancement in the field of large-scale scene generation. By successfully addressing the challenges of maintaining coherence over long trajectories and enabling detailed control over generated content, this work paves the way for numerous practical applications. Potential use cases include virtual urban exploration, digital city planning, and augmented reality (AR) experiences.

Future developments in this domain could explore several exciting directions:

Enhanced Object Modeling: Incorporating explicit modeling of transient objects, such as vehicles and pedestrians, to further improve the realism of the generated scenes.
Higher Frame Generation: Extending the autoregressive capabilities to handle even longer sequences, possibly leveraging recent advancements such as WALT and Lumiere.
Multimodal Integration: Combining street view data with additional modalities like LiDAR or other 3D sensing technologies to enhance spatial accuracy and consistency.

In conclusion, the Streetscapes framework sets a new benchmark in the field of urban scene generation, demonstrating that video diffusion models, coupled with innovative autoregressive techniques, can achieve high-quality, consistent outputs over large spatial extents. This work not only addresses existing limitations but also opens avenues for future research and practical applications in urban visualization and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/boyang_deng/status/1814287764625686567

https://twitter.com/_akhaliq/status/1814147143461933464

https://twitter.com/javaeeeee1/status/1814258502581325955

https://twitter.com/gm8xx8/status/1814117876611133830