DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation (2501.16764v1)

Published 28 Jan 2025 in cs.CV

Abstract: Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.

Summary

The paper introduces a novel framework that adapts pretrained 2D diffusion models to efficiently generate high-fidelity 3D Gaussian splats.
It employs a lightweight reconstruction model with dual loss functions to ensure geometric fidelity and multi-view consistency in 3D outputs.
Experimental results show superior performance in text- and image-conditioned tasks, outperforming traditional 3D methods in metrics like PSNR, SSIM, and LPIPS.

The paper "DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation" presents a novel framework for generating 3D Gaussian splats, termed DiffSplat, by leveraging large-scale image diffusion models. The work addresses significant challenges in converting 2D images or text inputs into high-fidelity 3D content, which is a longstanding issue due to limitations in available high-quality 3D datasets and the inconsistencies that arise in multi-view 2D generation methods.

Key Contributions

Integration of 2D Priors in 3D Generative Models: DiffSplat represents a departure from previous models by effectively utilizing large pretrained 2D diffusion models. This approach enhances 3D content generation while preserving consistency across multiple views, a common challenge in 3D modeling.
Efficient Dataset Creation via Structured Gaussian Splats: The authors introduce a lightweight reconstruction model that produces multi-view Gaussian splat grids, facilitating scalable dataset curation. This process helps in bootstrapping the framework's training by quickly generating synthetic 3D data from multi-view renderings, allowing for efficient and high-quality dataset creation in less than 0.1 seconds.
Novel Framework Architecture: DiffSplat reformulates 3D content generation by regressing 3D Gaussian splats through adaptations of the image diffusion models. By treating Gaussian splat grids similarly to 2D images, the framework leverages the latent space of image diffusion models to facilitate the 3D generation process.
Dual Loss for Improved Consistency: A diffusion loss is employed alongside a novel 3D rendering loss to maintain coherent appearances across arbitrary views. This combination ensures enhanced 3D consistency, resolving the collapse issues seen in many earlier 3D generation models that rely on image-based priors.
Compatibility with Pretrained Models: The framework's minimal modifications allow it to remain compatible with a variety of pretrained text-to-image diffusion models. This flexibility enables the application of advanced image generation techniques to the field of 3D creation.

Experimental Validation

The authors validate their approach through extensive experiments on text- and image-conditioned generation tasks. DiffSplat demonstrates superior performance in both qualitative and quantitative metrics when compared with state-of-the-art native 3D generative models and two-stage reconstruction-based methods. Notable improvements are shown in:

Prompt Alignment and Visual Quality:

In text-conditioned tasks, DiffSplat significantly outperforms competitors, particularly in handling complex prompts involving multiple objects or scenes.

Geometric Fidelity:

In image-conditioned tasks, the model delivers accurate 3D outputs that align closely with input images, surpassing existing methods in metrics like PSNR, SSIM, and LPIPS.

Applications and Future Directions

The paper also explores the adaptability of control techniques such as ControlNet to facilitate controlled 3D generation. This ability to seamlessly incorporate control and condition from various input forms illustrates the potential for DiffSplat to impact diverse fields, including virtual reality and game design.

While the framework advances the field significantly, the authors identify future work in improving the quality of mesh conversion and integrating physical-based material properties. Furthermore, integrating real-world video datasets could further expand DiffSplat's capabilities, leveraging its scalability and integration with the image generation community.

Overall, the paper presents a compelling argument for the integration of 2D image priors in 3D content generation, offering a scalable and efficient pathway to high-quality 3D modeling. The proposed DiffSplat framework marks a significant step towards bridging the gap between 2D image generation and 3D content creation.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1884611552151585143

https://twitter.com/janusch_patas/status/1884481443528860051

https://twitter.com/javaeeeee1/status/1884554551446913402

https://twitter.com/arXivGPT/status/1885026288819187939

Reddit

[2501.16764] DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation (2 points, 1 comment)