Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D (2311.16918v2)

Published 28 Nov 2023 in cs.CV and cs.AI

Abstract: Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://aigc3d.github.io/richdreamer/.

Introduction to 3D Generation from Text

The field of AI-powered image generation has experienced significant growth, especially with advancements in generative models and powerful training datasets. However, transforming text descriptions into 3D models remains a challenge. Recent developments have made progress, particularly through text-to-3D systems, demonstrating impressive zero-shot generation by optimizing neural radiance fields. Despite this, challenges persist, particularly in creating detailed, rich 3D models that are both geometrical and material accurate.

Overcoming the Challenges of 3D Generation

Traditional methods have approached the challenge of generating 3D content by generating the geometry first and then the texture. However, directly using 2D diffusion models, which are impressive at generating images, are less effective for generating 3D geometries and textures due to distribution differences between natural images and normal maps. To address this, the paper proposes a Normal-Depth diffusion model for 3D generation, which demonstrates significant improvements in detail richness.

Details of the Normal-Depth Diffusion Model

The Normal-Depth diffusion model is particularly innovative because it captures the joint distribution of normal maps and depth information, which are both crucial for detailing the shape and structure of a scene. By training on a large dataset of image-caption pairs and fine-tuning on synthetic datasets, the model can maintain generalization while capturing a wide variety of real-world scenes. Coupled with an albedo diffusion model, this approach helps to separate material reflectance from illumination effects, leading to more accurate appearance modeling for generated 3D objects.

Experimental Results and Contributions

When integrated into existing text-to-3D pipelines, the new models significantly enhance the fidelity of generated 3D content. The experimental evaluation against other state-of-the-art methods shows superior results in terms of geometry and texture details. Additional user studies further confirm that the approach yields visually appealing models that align closely with the text prompts. The key contributions of the paper include the development of the Normal-Depth diffusion model and the albedo diffusion model, which bring marked advancements in the text-to-3D domain.

In conclusion, this research represents a substantial step forward in generative 3D modeling from textual descriptions, offering a well-rounded solution to a previously constrained problem area. The approach facilitates the creation of more detailed, accurate 3D models, unlocking new potential applications and improvements for fields like virtual reality, game development, and beyond. Future work, as outlined by the paper, may focus on expanding these techniques to more complex scenarios, such as text-to-scene generation and improved regularization for material properties.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In CVPR, 2023.
  2. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  3. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  4. Brent Burley and Walt Disney Animation Studios. Physically-based shading at disney. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). vol. 2012, 2012.
  5. Generative novel view synthesis with 3d-aware diffusion models. In ICCV, 2023.
  6. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023.
  7. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  8. It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint arXiv:2308.11473, 2023.
  9. Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts. arXiv preprint arXiv:2310.11784, 2023.
  10. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In CVPR, 2023.
  11. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  12. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  13. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015, 2023.
  14. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  15. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022.
  16. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In Proceedings of the ACM International Conference on Machine Learning (ICML), 2023.
  17. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  18. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  19. Headsculpt: Crafting 3d head avatars with text. arXiv preprint arXiv:2306.03038, 2023.
  20. Escaping plato’s cave: 3d shape from adversarial rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  21. Denoising diffusion probabilistic models. NeurIPS, 2020.
  22. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023.
  23. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  24. Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv preprint arXiv:2310.01406, 2023.
  25. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  26. Dreamwaltz: Make a scene with complex 3d animatable avatars. arXiv preprint arXiv:2305.12529, 2023.
  27. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  28. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  29. Holofusion: Towards photo-realistic 3d generative modeling. In ICCV, 2023.
  30. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023.
  31. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  32. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 2020.
  33. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023.
  34. Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
  35. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  36. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023.
  37. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  38. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
  39. Meshdiffusion: Score-based generative 3d mesh modeling. In ICLR, 2023.
  40. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  41. Att3d: Amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349, 2023.
  42. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  43. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  44. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  45. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  46. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  47. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia Conference Papers, 2022.
  48. Diffrf: Rendering-guided 3d radiance field diffusion. In CVPR, 2023.
  49. Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022.
  50. Extracting Triangular 3D Models, Materials, and Lighting From Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  51. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  52. Depth super-resolution meets uncalibrated photometric stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  53. Dreamfusion: Text-to-3d using 2d diffusion. In Proceedings of the The International Conference on Learning Representations (ICLR), 2023.
  54. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  55. Learning transferable visual models from natural language supervision. In Proceedings of the ACM International Conference on Machine Learning (ICML), 2021.
  56. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  57. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 2021.
  58. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  59. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2020.
  60. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In CVPR, 2021.
  61. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  62. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  63. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  64. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  65. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  66. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS, 2021.
  67. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
  68. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  69. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  70. Ldm3d: Latent diffusion model for 3d. arXiv preprint arXiv:2305.10853, 2023.
  71. Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data. arXiv preprint arXiv:2306.07881, 2023.
  72. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  73. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  74. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  75. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
  76. Microfacet models for refraction through rough surfaces. In Proceedings of the 18th Eurographics conference on Rendering Techniques, 2007.
  77. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  78. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023.
  79. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  80. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  81. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  82. Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. arXiv preprint arXiv:2307.16183, 2023.
  83. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. NeurIPS, 2016.
  84. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  85. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 views. arXiv e-prints, 2022.
  86. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR, 2023.
  87. Matlaber: Material-aware text-to-3d via latent brdf auto-encoder. arXiv preprint arXiv:2308.09278, 2023.
  88. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. arXiv preprint arXiv:2310.03020, 2023.
  89. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  90. Jonathan Young. xatlas. https://github.com/jpcy/xatlas, 2021.
  91. Pushing the limits of 3d shape generation at scale. arXiv preprint arXiv:2306.11510, 2023.
  92. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  93. Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
  94. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. In SIGGRAPH, 2023.
  95. Dreamface: Progressive generation of animatable 3d faces under text guidance. arXiv preprint arXiv:2304.03117, 2023.
  96. Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223, 2023.
  97. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. arXiv preprint arXiv:2306.17115, 2023.
  98. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Lingteng Qiu (18 papers)
  2. Guanying Chen (32 papers)
  3. Xiaodong Gu (62 papers)
  4. Qi Zuo (8 papers)
  5. Mutian Xu (12 papers)
  6. Yushuang Wu (16 papers)
  7. Weihao Yuan (34 papers)
  8. Zilong Dong (34 papers)
  9. Liefeng Bo (84 papers)
  10. Xiaoguang Han (118 papers)
Citations (78)
Youtube Logo Streamline Icon: https://streamlinehq.com