Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs (2503.05082v1)

Published 7 Mar 2025 in cs.CV

Abstract: Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Summary

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

The paper "Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs" addresses persistent issues in 3D modeling from sparse views, notably extrapolation and occlusion. The authors propose a novel framework that exploits the latent capabilities of video diffusion models to enhance the integrity and consistency of 3D reconstructions using 3D Gaussian Splatting (3DGS).

Context and Contributions

Recent advancements in Novel View Synthesis (NVS) have leveraged techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) to create high-fidelity images from various angles. However, the challenge remains when the input data is sparse, leading to difficulties in accurately modeling regions outside the field of view and those occluded by other structures. This paper contributes to overcoming these challenges through several innovative strategies:

Scene-Grounding Guidance: This technique involves the use of an optimized 3DGS model to produce rendered sequences that serve as consistent references during the process of denoising in video diffusion. By guiding the diffusion process with realistic scene data, the model can produce more consistent and reliable outputs without any need for retraining, enhancing the plausibility of generated scenes.
Trajectory Initialization: The authors propose a robust method for trajectory planning, which identifies critical regions that lack coverage in sparse input settings. By initializing camera trajectories based on the points of interest determined by an optimized 3DGS model, the framework ensures comprehensive scene modeling that encompasses overlooked regions.
3DGS Optimization with Generated Sequences: Through carefully designed loss functions and sampling strategies, the paper presents an effective way to refine 3DGS models using generated video frames. This approach not only fills in missing details but also improves the overall quality of rendered images.

Empirical Results

The proposed method has been rigorously tested on challenging indoor datasets like Replica and ScanNet++, demonstrating substantial improvements over baseline models. Key numerical outcomes include enhancements exceeding 3.5 dB in PSNR for Replica and 2.5 dB for ScanNet++, leading to state-of-the-art performance. Qualitative analyses further reveal how the new approach mitigates common artifacts and enhances detail resolution, especially in complex scenarios where previous models struggle.

Implications and Future Directions

The implications of this paper are significant both practically and theoretically. Practically, the improved modeling of sparse-view inputs could advance applications in areas such as virtual reality, robotics, and autonomous navigation, where understanding under-sampled environments accurately is crucial. Theoretically, the integration of diffusion models with 3D rendering techniques suggests a promising avenue for further exploration in AI-driven image synthesis, pushing the boundaries of what can be achieved with minimal input data.

Future research could delve into optimizing the diffusion model itself for real-time applications, tackling computational constraints without sacrificing the enhancements in visual fidelity. Additionally, exploration into hybrid models that combine 3DGS with other representation formats may yield even greater accuracy and realism in sparse-input scenarios. As diffusion models advance, so too will the potential for more dynamic and interactive 3D environments.

This paper lays a foundational framework, demonstrating the untapped potential of video diffusion models within the domain of 3D modeling. Through careful design and insightful integration of scene-grounding practices, it paves the way for more robust and versatile applications in computer vision and AI-driven graphics.

Tweets

https://twitter.com/zhenjun_zhao/status/1898972846480924830

https://twitter.com/danxuhk/status/1933636508382122111

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs (2503.05082v1)

Summary