Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D (2310.02596v2)

Published 4 Oct 2023 in cs.CV

Abstract: It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This "coarse" alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality objects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with an 85+% consistency rate by human evaluation, while many previous methods are around 30%. Our project page is https://sweetdreamer3d.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Dreamfusion project webpage, 2023. URL https://dreamfusion3d.github.io/.
  2. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  3. Genvs: Generative novel view synthesis with 3d-aware diffusion models, 2023.
  4. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In International Conference on Computer Vision (ICCV), October 2023.
  6. Objaverse: A universe of annotated 3d objects. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13142–13153, 2023.
  7. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  8. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023.
  9. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  10. DeepFloyd IF. Deepfloyd if, 2023. URL https://huggingface.co/DeepFloyd.
  11. Introducing superalignment, 2023. URL https://openai.com/blog/introducing-superalignment.
  12. Magic3d: High-resolution text-to-3d content creation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  13. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.
  14. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  15. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
  16. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (SIGGRAPH), 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.
  17. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML), pp.  16784–16804. PMLR, 2022.
  18. Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations (ICLR), 2022.
  19. Zero-shot text-to-image generation. In International Conference on Machine Learning (ICML), pp.  8821–8831. PMLR, 2021.
  20. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  21. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, 2022.
  22. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022a.
  23. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022b.
  24. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  25. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  26. Scene coordinate regression forests for camera relocalization in rgb-d images. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  2930–2937, 2013.
  27. A real-world dataset for multi-view 3d reconstruction. In European Conference on Computer Vision (ECCV), pp.  56–73, 2022.
  28. Textmesh: Generation of realistic 3d meshes from text prompts. 2023a.
  29. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023b.
  30. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  31. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12619–12629, 2023a.
  32. Normalized object coordinate space for category-level 6d object pose and size estimation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  2642–2651, 2019.
  33. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  34. Novel view synthesis with diffusion models. In International Conference on Learning Representations (ICLR), 2022.
  35. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  803–814, 2023.
  36. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023.
  37. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12588–12597, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Weiyu Li (33 papers)
  2. Rui Chen (310 papers)
  3. Xuelin Chen (17 papers)
  4. Ping Tan (101 papers)
Citations (91)

Summary

Insightful Overview of "SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D"

The paper "SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D" explores the challenge of lifting 2D visuals derived from diffusion models into coherent 3D representations. Traditional 2D diffusion models are limited by their inherent lack of 3D awareness, often leading to inconsistencies across multiple views. This research identifies the primary root of this issue as geometric inconsistency and proposes a novel solution by aligning 2D geometric priors with defined 3D geometry during the transformation process.

The authors propose a method to fine-tune 2D diffusion models, making them viewpoint-aware to create view-specific coordinate maps. This process incorporates only coarse 3D information to resolve geometric inconsistencies while preserving the rich, detailed generation capabilities of 2D models. The key innovation is the creation of Aligned Geometric Priors (AGP), which provide a robust framework to mitigate multi-view inconsistencies and ensure high-quality, diversified output. This method has demonstrated a notable increase in consistency rate, achieving over 85% consistency according to human evaluation, which positions it well above the 30% benchmarks achieved by preceding methods.

Technical Approach and Methodology

  1. Identifying Inconsistencies: The paper delineates two main types of inconsistencies in text-to-3D synthesis: geometric and appearance inconsistencies. Geometric inconsistencies are the major focus, as they more frequently result in perceptual errors when transitioning from 2D to 3D.
  2. Geometric Priors in 2D Diffusion: The authors leverage the inherent geometric priors within 2D diffusion models, aligning these with 3D structures. By fine-tuning the 2D models to produce canonical coordinate maps that translate into 3D viewpoints, the method circumvents the traditional data-hungry requirements of 3D model training.
  3. Utilizing Canonical Coordinates and Camera Conditioning: The process involves rendering depth maps from canonical 3D models, producing coordinate maps that serve as inputs during model fine-tuning. This integration of coarse yet consistent geometric information is computationally efficient and enhances the viewpoint-awareness needed for accurate 3D modeling.
  4. Fine-tuning Procedures: Implementing model fine-tuning without compromising the generative capabilities of the original 2D model is crucial. This method allows the sweet spot of integrating coarse geometric alignment while maintaining the original model's capacity for high-fidelity and high-diversity output.

Integration and Results

The paper demonstrates the integration of AGP into existing state-of-the-art text-to-3D pipelines, such as DMTet-based and NeRF-based representations. The results highlight the versatility and compatibility of AGP, seamlessly enhancing geometric model accuracy without intruding upon appearance rendering. Quantitatively, this yielded a substantial improvement in the consistency rate, showcasing its efficacy above other contemporary approaches.

Implications and Future Directions

The implications of integrating AGP into text-to-3D systems extend to both theoretical and practical domains. Theoretically, the method capitalizes on the latent geometric knowledge embedded within 2D diffusion models, unlocking a more coherent path for translating 2D generative capabilities into the 3D field. Practically, this approach negates the pressing need for vast 3D datasets, cutting down on resource expenditure while maintaining output quality.

Looking forward, research could explore extending this alignment strategy towards addressing the rarer appearance inconsistencies, potentially through selective utilization of complementary appearance priors. Additionally, further refinement in handling complex geometric structures could open new avenues for more intricate and realistic 3D synthesis in real-time applications.

In conclusion, "SweetDreamer" delivers a critical advancement in overcoming the multi-view consistency challenges inherent in text-to-3D synthesis. By effectively realigning geometric priors within pre-trained 2D diffusion models, the paper elevates the potential and practical applicability of generative models across dimensional boundaries.

Github Logo Streamline Icon: https://streamlinehq.com