StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation (2501.05763v4)

Published 10 Jan 2025 in cs.CV

Abstract: Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods. Project page: https://zju3dv.github.io/StarGen.

Summary

The paper presents StarGen's main contribution by integrating spatiotemporal autoregression with a video diffusion model for long-range, coherent scene generation.
Methodology leverages overlapping video clips conditioned on spatial and temporal data alongside a robust 3D reconstruction model to ensure scene fidelity.
The model outperforms state-of-the-art approaches using metrics like PSNR, SSIM, LPIPS, and FID, demonstrating its scalable and versatile application.

Overview of StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model

The paper introduces StarGen, a novel framework for scalable and controllable scene generation through spatiotemporal autoregression. This work addresses the challenge of long-range scene generation using large reconstruction and generative models, which are typically hindered by computational resource limitations. The proposed framework leverages a pre-trained video diffusion model, incorporated in an autoregressive manner, to enhance spatiotemporal consistency in the scene generation process.

Methodology

StarGen introduces a unique approach to achieving long-range scene generation by conditioning the generation process on both spatially adjacent images and temporally overlapping images. The method involves:

Spatiotemporal Autoregression: The scene is generated as a sequence of overlapping windowed video clips. Each window is conditioned on images produced in previous windows, maintaining consistency across a long-range trajectory.
Integration of Large Reconstruction Models: A pivotal component is the integration of large reconstruction models to extract 3D structural information from spatially adjacent images. This information conditions the generation of novel views through a dedicated video diffusion model combined with ControlNet.
Versatile Generation Tasks: StarGen demonstrates versatility by tackling three distinct tasks: sparse view interpolation, perpetual view generation, and layout-conditioned city generation. This adaptability underscores its potential to address a spectrum of scene generation challenges.

The framework's methodology is bolstered by the design of a large-scale reconstruction model and a causal compression network, supporting the video diffusion model to execute the complex task of generating novel views with scene continuity.

Quantitative and Qualitative Evaluation

The performance of StarGen is evaluated using both qualitative and quantitative metrics. The results denote significant improvements over existing state-of-the-art methods across various metrics, including PSNR, SSIM, and LPIPS for short-range generation tasks, and FID for long-range scenarios. Notably, StarGen showcases superior scalability and pose accuracy, successfully maintaining scene fidelity over extended trajectories compared to competing approaches.

Implications and Future Directions

StarGen presents implications for both practical applications and theoretical advancements within scene generation and AI-driven content creation fields. Practically, its ability to synthesize coherent, high-quality scenes over extended spatial domains can enhance applications in gaming, virtual reality, and autonomous navigation, among others. Theoretically, the work paves the way for future research into refining spatiotemporal conditioning and further integrating diffusion models with advanced 3D reconstruction techniques.

The paper identifies limitations in handling large loops where the generated content can diverge over extended sequences. As such, the authors propose addressing these limitations by exploring absolute constraints and 3D reconstructions of generated scenes. This direction promises to enhance robustness and practical utility in complex, three-dimensional environments.

In conclusion, StarGen represents an advancement in leveraging autoregressive frameworks for scene generation, integrating contemporary diffusion models with traditional reconstruction techniques to push the boundaries of what is computationally feasible in generating richly detailed and cohesive virtual scenes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1878650256268902757

https://twitter.com/Synced_Global/status/1880070869348937938