Stable Virtual Camera: Generative View Synthesis with Diffusion Models (2503.14489v2)

Published 18 Mar 2025 in cs.CV

Abstract: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. Project page with code and model: https://stable-virtual-camera.github.io/.

Authors (9)

Hang Gao (61 papers)
Vikram Voleti (25 papers)
Aaryaman Vasishta (3 papers)
Chun-Han Yao (13 papers)
Mark Boss (12 papers)
Philip Torr (172 papers)
Christian Rupprecht (90 papers)
Varun Jampani (125 papers)
Jensen Zhou (1 paper)

Summary

Stable Virtual Camera and Generative View Synthesis

In the paper, titled "Stable Virtual Camera: Generative View Synthesis with Diffusion Models," the authors present an innovative approach to Novel View Synthesis (NVS) utilizing diffusion models. Unlike traditional methods that rely heavily on dense input views for processing, this method demonstrates the facility of generative view synthesis allowing for realistic images from arbitrary camera angles with sparse inputs.

The Stable Virtual Camera (SVC) is a diffusion model leveraging multiple input views and target cameras to generate novel visuals, excelling particularly in scenarios requiring significant viewpoint changes and smooth temporal transitions without compromising consistency. This method highlights significant progress over previous models, demonstrating high adaptability across varied NVS tasks without needing additional 3D representation-based distillation.

Model Design and Training Strategy

The diffusion model architecture is influenced by Stable Diffusion, particularly version 2.1, integrating sophisticated conditioning mechanisms such as Plücker embeddings and CLIP image embeddings to capture semantic context effectively. Structured training strategy further strengthens this architecture by adapting the model to varied view synthesis tasks ranging from interpolating small viewpoint changes to generating large view-centric sequences. The designed training involves a two-stage curriculum to accommodate complexities in dealing with differing numbers of input and target views, thus ensuring robustness in real-world applications.

Sampling Novel Views with Procedural Strategies

The sampling strategy at test time is distinguished by its two-pass procedural approach facilitating consistency and flexibility. This involves an initial anchor frame generation followed by interpolation or nearest-neighbor sampling depending on task specifics such as trajectory NVS or set NVS. Such procedural strategies significantly enhance the generative model's ability to produce temporally smooth transition videos along arbitrary camera paths.

Benchmark and Evaluation

A comprehensive benchmark across various datasets and experimental configurations shows marked improvements by the Stable Virtual Camera on PSNR and other standard metrics. Particularly, this model outperforms existing methods in large-viewpoint NVS tasks by +1.5 dB over CAT3D, underscoring its strong generation capacity. Results from semi-dense and sparse-view setups exhibit similar advantages in terms of photorealism and flexibility. The model's performance is further corroborated by its ability to handle up to 32 input views, a testament to its capability in handling varied data regimes.

Implications and Future Directions

The implications of this research are multi-faceted, extending practical applications from virtual cinematography and gaming to digital preservation. The ability to generate high-quality and consistent visuals in real time enhances immersive environments manifoldly. The streamlined approach, devoid of complex distillation and representations, proposes a new standard in generative NVS, opening pathways for its integration into broader AI ecosystems. Future AI developments may look into extending such diffusion-based models' capabilities in dynamic scene synthesis, leveraging their robust generative power across more complex real-world scenarios. Additionally, addressing limitations concerning domain-specific training would further elevate its potential impact across diverse applications. The continuous evolution of such models hints at broader horizons possibly involving real-time adaptive learning within dynamic environments.

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1902645138079592945

https://twitter.com/Sopan_More11/status/1903712009369026952

https://twitter.com/susumuota/status/1909759720690241780

https://twitter.com/arxivsanitybot/status/1902715614453543106

https://twitter.com/ryo694/status/1905958495750382001

https://twitter.com/learnprompting/status/1910271695950344440

YouTube

Show All Videos