DreamFusion: Text-to-3D using 2D Diffusion (2209.14988v1)

Published 29 Sep 2022 in cs.CV, cs.LG, and stat.ML

Abstract: Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs. Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D data and efficient architectures for denoising 3D data, neither of which currently exist. In this work, we circumvent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.

Citations (1,853)

View on Semantic Scholar

Summary

The paper introduces Score Distillation Sampling to optimize a NeRF using a pretrained 2D diffusion model for text-guided 3D synthesis without 3D data.
It achieves higher CLIP R-Precision and improved visual coherence over methods like Dream Fields and CLIP-Mesh, aligning generated models closely with text prompts.
The approach democratizes 3D content creation, enabling applications in gaming, film, and VR while prompting future research on resolution and diversity enhancements.

DreamFusion: Text-to-3D using 2D Diffusion

"DreamFusion: Text-to-3D using 2D Diffusion" introduces a significant advancement in the application of diffusion models for text-to-3D synthesis without the need for 3D training data. The paper proposes a novel approach that leverages a pretrained 2D text-to-image diffusion model to generate 3D models, circumventing the traditional constraints associated with 3D datasets and architectures.

Methodology

The core of DreamFusion utilizes a technique termed Score Distillation Sampling (SDS). This approach allows the use of a pretrained 2D diffusion model as a loss function to optimize a Neural Radiance Field (NeRF) representation of a 3D scene. Here is a breakdown of the methodology:

Diffusion Models: The foundational 2D diffusion model is trained on a vast dataset of labeled image-text pairs. This model learns to generate high-quality images from text descriptions by iteratively denoising a sample from a noise distribution towards the data distribution.
Score Distillation Sampling (SDS): SDS minimizes the Kullback–Leibler (KL) divergence between a family of Gaussian distributions defined over the generated sample and the score functions from the pretrained diffusion model. By optimizing this KL divergence, the 3D model is iteratively refined to produce renderings that align with the text-conditioned 2D diffusion model's prior.
NeRF Optimization: The paper adapts the NeRF, a model originally designed for 3D reconstruction from multiple 2D images, to this new generative task. The NeRF is optimized so that its 2D renderings from random angles achieve a low SDS loss. Importantly, no modifications to the diffusion model are required, and the optimization process does not necessitate 3D supervision.

Results

The empirical evaluation of DreamFusion demonstrates its superiority over previous methods like Dream Fields and CLIP-Mesh. Key results include:

CLIP R-Precision: DreamFusion achieves higher CLIP R-Precision scores compared to baselines, indicating a stronger alignment between the generated 3D models and the text prompts. For example, in the object-centric subset of COCO, DreamFusion scored 77.5 using the CLIP B/16 model, while Dream Fields and CLIP-Mesh had scores of 74.2 and 75.8, respectively.
Qualitative Evaluation: The visual output of DreamFusion is shown to be more detailed and coherent compared to baselines. Moreover, the generated 3D models exhibit improved geometric consistency when viewed from different angles.

Implications

The implications of this work are manifold:

Practical Applications: By lowering the barrier to 3D content creation, DreamFusion can significantly impact industries such as gaming, film, and virtual reality, where the demand for detailed 3D assets is high.
Accessibility: DreamFusion has the potential to democratize 3D modeling, making it more accessible to non-experts and enhancing creativity and productivity among professional artists.
Future Research: The robustness of SDS and the technique of leveraging 2D models for 3D synthesis without explicit 3D training data opens new avenues in generative modeling and inverse rendering.

Challenges and Future Directions

While DreamFusion marks considerable progress, several challenges and areas for future research remain:

Loss Function Limitations: SDS often produces oversmoothed results compared to traditional sampling methods. Future work could develop refined loss functions or optimization techniques to mitigate this issue.
Model Resolution: The current implementation using a $64\times64$ Imagen model limits the detail of the synthesized 3D models. Scaling this approach to higher resolutions will be crucial for producing finer details essential for practical applications.
Diversity in Outputs: DreamFusion’s optimization approach tends towards mode-seeking, which can reduce the diversity in generated 3D models. Exploring strategies to enhance the diversity of outputs while maintaining coherence with the input text is an important direction for further research.

Conclusion

"DreamFusion: Text-to-3D using 2D Diffusion" provides a sophisticated yet elegant solution to the problem of text-to-3D synthesis. By innovatively employing pretrained 2D diffusion models and optimizing 3D NeRF representations through SDS, this paper sets a new benchmark in the field, demonstrating the profound potential of integrating diffusion models with neural rendering techniques for generative tasks in computer graphics.

PDF Markdown

Related Papers

YouTube

Show All Videos