- The paper introduces Generative Gaussian Splatting (GGS), combining 3D Gaussian splatting with video diffusion priors to synthesize consistent 3D scenes.
- GGS significantly improves 3D consistency and quality of generated scenes, showing ~20% FID improvement on datasets like RealEstate10K.
- This method offers practical implications for areas like cinematography and virtual reality by enhancing video diffusion models with crucial 3D consistency.
The paper "Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors" (2503.13272) introduces a novel method, Generative Gaussian Splatting (GGS), for synthesizing consistent and photorealistic 3D scenes by integrating a 3D representation with a pre-trained latent video diffusion model. This approach addresses the limitations of video diffusion models, which typically lack 3D consistency in generated sequences, and the challenges associated with directly training generative 3D models due to the scarcity of large-scale 3D training data.
Key Components and Methodology of GGS
GGS synthesizes a feature field parameterized via 3D Gaussian primitives, which can be rendered into multi-view feature maps and subsequently decoded into multi-view images or upsampled into a 3D radiance field. The method leverages a U-Net architecture within the diffusion model, pre-trained for video diffusion, to transform reference images into coherent 3D scenes. Camera poses, encoded using Plücker embeddings, condition the diffusion model, generating feature fields realized through Gaussian splats. These splats are then rendered into images and feature maps, followed by refinement to achieve low-noise reconstructions.
Experimental Results and Evaluation Metrics
The authors evaluated GGS on the RealEstate10K and ScanNet++ datasets, using metrics such as FID, FVD for image fidelity and temporal consistency, PSNR and LPIPS for image quality, and TSED for 3D consistency. The results demonstrate that GGS significantly improves the 3D consistency and quality of generated 3D scenes compared to relevant baselines. Specifically, GGS improves FID on the generated 3D scenes by approximately 20% on both RealEstate10K and ScanNet+ relative to similar models lacking 3D representation.
Comparison with Baselines
GGS was compared against baselines such as ViewCrafter and CameraCtrl, demonstrating marked improvements in image consistency and quality due to the integration of explicit 3D representations. The method's ability to generalize across unseen datasets while maintaining high fidelity highlights its robustness.
Implications and Potential Future Research
The GGS approach offers both theoretical and practical implications, opening avenues for further research into the integration of 3D reconstructions within generative models. Enhancing video diffusion models with 3D consistency has significant applications in areas such as cinematography and virtual reality, where coherent and dynamic scene generation is crucial.
Future research directions include:
- Optimizing the computational complexity associated with Gaussian splats.
- Exploring alternative parameterization schemes for improved scalability and precision.
- Synergizing GGS with neural rendering techniques to potentially yield further advancements in 3D generative models.