MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views (2411.04924v1)

Published 7 Nov 2024 in cs.CV

Abstract: We introduce MVSplat360, a feed-forward approach for 360{\deg} novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360{\deg} NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: https://donydchen.github.io/mvsplat360.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a feed-forward method combining 3D Gaussian Splatting with Video Diffusion to synthesize 360° scenes from sparse views.
It integrates geometric cues and latent diffusion to produce photorealistic novel views, outperforming traditional dense input approaches.
Evaluation on DL3DV-10K and RealEstate10K datasets demonstrates notable improvements in FID and perceptual metrics for robust real-world applications.

Analysis of MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

The paper presents MVSplat360, a novel approach addressing the need for generating high-quality 360° scene synthesis from a minimal set of sparse views. Traditional novel view synthesis (NVS) faces substantial challenges in this sparse setting due to the limited overlap between input views and significant occlusion, a consequence of inadequate visual information. Conventional methodologies, such as NeRF and per-scene optimization techniques, often demand dense input data, which renders them impractical for ordinary applications. MVSplat360 offers a potential solution by ingeniously merging geometry-aware 3D reconstruction with a temporally consistent video generation model.

The proposed methodology utilizes a geometry-aware approach, specifically 3D Gaussian Splatting (3DGS), to form the geometric structure from sparse observations, integrating it within a feed-forward network to produce novel views. The MVSplat360 framework significantly capitalizes on the prior knowledge embedded within a large-scale pre-trained Stable Video Diffusion (SVD) model. By projecting features into the latent space of the SVD model, which facilitates maintaining multi-view consistency through temporal coherence, MVSplat360 successfully guides the denoising process to achieve photorealistic results.

Key contributions of the MVSplat360 include:

Problem Formulation: MVSplat360 addresses the critical problem of generating novel 360° views from sparse, widely displaced observations—a scenario often encountered in real-world applications where acquiring an extensive dataset is not feasible.
Integration of 3DGS and SVD: The integration of 3DGS and the SVD model is meticulously designed to leverage geometric cues and refine the visual appearance of generated views. This marriage of techniques allows the model to imaginatively fill in unobserved areas, significantly outperforming prior scene-level synthesis methods that rely solely on dense data inputs.
Benchmarking and Evaluation: The paper provides an extensive evaluation using the DL3DV-10K and RealEstate10K datasets, with MVSplat360 demonstrating superior performance across both pixel-aligned and perceptual metrics. Notable improvements in the Fréchet Inception Distance (FID) metric underscore its competence in generating images that are not just aesthetically pleasing but also align well with real-world image distributions.

Theoretical and practical implications arising from MVSplat360's contributions are substantial. The framework demonstrates the potential to generalize NVS approaches beyond object-level synthesis to more complex scene-level tasks where sparse input data is the norm. The ability to produce viable results with as few as three viewpoints suggests a robustness and flexibility not present in earlier models.

The implications for the future development of AI-based scene synthesis are significant. The paper lays a foundation for further exploration into sparse input-driven models, indicating the possibility of deploying such techniques in AR applications, gaming, and educational tools requiring realistic scene reconstructions from limited datasets. Further, as AI progresses and more advanced diffusion models emerge, we can anticipate models like MVSplat360 will see enhancements in achieving even more realistic appearances and faster inference times.

In conclusion, MVSplat360 stands out by bridging a technical gap in scene synthesis under sparse input conditions. While existing limitations concerning inference speed and artifact handling due to hallucinations need addressing, MVSplat360 shows a promising direction towards efficient and practical scene generation solutions that align with real-world application constraints. The work contributes significantly to ongoing research by suggesting novel ways to approach complex scene synthesis tasks with limited observational data.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/zhenjun_zhao/status/1854739218708472297