Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors (2503.13272v1)

Published 17 Mar 2025 in cs.CV

Abstract: Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: https://katjaschwarz.github.io/ggs/

Summary

The paper introduces Generative Gaussian Splatting (GGS), combining 3D Gaussian splatting with video diffusion priors to synthesize consistent 3D scenes.
GGS significantly improves 3D consistency and quality of generated scenes, showing ~20% FID improvement on datasets like RealEstate10K.
This method offers practical implications for areas like cinematography and virtual reality by enhancing video diffusion models with crucial 3D consistency.

The paper "Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors" (2503.13272) introduces a novel method, Generative Gaussian Splatting (GGS), for synthesizing consistent and photorealistic 3D scenes by integrating a 3D representation with a pre-trained latent video diffusion model. This approach addresses the limitations of video diffusion models, which typically lack 3D consistency in generated sequences, and the challenges associated with directly training generative 3D models due to the scarcity of large-scale 3D training data.

Key Components and Methodology of GGS

GGS synthesizes a feature field parameterized via 3D Gaussian primitives, which can be rendered into multi-view feature maps and subsequently decoded into multi-view images or upsampled into a 3D radiance field. The method leverages a U-Net architecture within the diffusion model, pre-trained for video diffusion, to transform reference images into coherent 3D scenes. Camera poses, encoded using Plücker embeddings, condition the diffusion model, generating feature fields realized through Gaussian splats. These splats are then rendered into images and feature maps, followed by refinement to achieve low-noise reconstructions.

Experimental Results and Evaluation Metrics

The authors evaluated GGS on the RealEstate10K and ScanNet++ datasets, using metrics such as FID, FVD for image fidelity and temporal consistency, PSNR and LPIPS for image quality, and TSED for 3D consistency. The results demonstrate that GGS significantly improves the 3D consistency and quality of generated 3D scenes compared to relevant baselines. Specifically, GGS improves FID on the generated 3D scenes by approximately 20% on both RealEstate10K and ScanNet+ relative to similar models lacking 3D representation.

Comparison with Baselines

GGS was compared against baselines such as ViewCrafter and CameraCtrl, demonstrating marked improvements in image consistency and quality due to the integration of explicit 3D representations. The method's ability to generalize across unseen datasets while maintaining high fidelity highlights its robustness.

Implications and Potential Future Research

The GGS approach offers both theoretical and practical implications, opening avenues for further research into the integration of 3D reconstructions within generative models. Enhancing video diffusion models with 3D consistency has significant applications in areas such as cinematography and virtual reality, where coherent and dynamic scene generation is crucial.

Future research directions include:

Optimizing the computational complexity associated with Gaussian splats.
Exploring alternative parameterization schemes for improved scalability and precision.
Synergizing GGS with neural rendering techniques to potentially yield further advancements in 3D generative models.