Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors (2312.04963v1)

Published 7 Dec 2023 in cs.CV and cs.AI

Abstract: Most 3D generation research focuses on up-projecting 2D foundation models into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS) loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these methods often lead to geometric anomalies and multi-view inconsistency. Recently, researchers have attempted to improve the genuineness of 3D objects by directly training on 3D datasets, albeit at the cost of low-quality texture generation due to the limited texture diversity in 3D datasets. To harness the advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a unified framework that incorporates both a 3D and a 2D diffusion process, to preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a simple combination may yield inconsistent generation results, we further bridge them with novel bidirectional guidance. In addition, our method can be used as an initialization of optimization-based models to further improve the quality of 3D model and efficiency of optimization, reducing the generation process from 3.4 hours to 20 minutes. Experimental results have shown that our model achieves high-quality, diverse, and scalable 3D generation. Project website: https://bidiff.github.io/.

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D Priors

The paper "Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors" proposes a novel framework named Bidirectional Diffusion (BiDiff) that advances the field of generative models by addressing challenges in text-to-3D object generation. This work effectively integrates both 2D and 3D generative processes to enhance the fidelity and consistency of generated 3D models through a dual diffusion approach.

Methodology Overview

The research builds on existing approaches by focusing on a bidirectional integration of both 2D and 3D priors. Bidirectional Diffusion represents a 3D object using a hybrid of 3D Signed Distance Fields (SDF) and multi-view 2D images, leveraging the strengths of both representations. Specifically, the proposed method employs a 2D diffusion model to enrich textural variety, while a 3D diffusion model ensures geometric accuracy and consistency.

The core innovation lies in the bidirectional guidance that aligns the two diffusion processes: the 3D generation is guided by denoised 2D outputs to ensure textural consistency, and vice versa, the 2D generation is guided by the 3D-generated intermediate states to maintain geometric coherence. This means intermediate results from both processes guide subsequent stages, adjusting each other's trajectories.

Practical Implications and Results

The framework demonstrates a markedly reduced processing time when compared to traditional optimization-based approaches, claiming the ability to produce high-quality 3D models in approximately 40 seconds, whereas conventional methods can take several hours. This is achieved without sacrificing the diversity and quality of the generated textures and geometries.

Quantitative evaluation through metrics such as CLIP R-Precision highlights that the BiDiff framework achieves competitive, if not superior, performance to other state-of-the-art methods while significantly enhancing computational efficiency. Furthermore, BiDiff provides an advantageous initialization for optimization-based methods, reducing refinement times and improving final model quality.

Theoretical Contributions

From a theoretical standpoint, this research introduces a cohesive mechanism to combine and synchronize 2D and 3D generative processes, which were previously applied in isolation. The fusion and mutual guidance between 2D and 3D diffusion models address inherent challenges such as multi-view inconsistency and geometric anomalies typical of unidirectional methods. The use of 3D priors from Shap-E further enhances the geometric robustness of the generated structures, while the inclusion of a comprehensive 2D prior ensures high-quality and texturally rich outcomes.

Future Directions

The demonstrated scalability and effectiveness of the BiDiff framework suggest promising future directions. Further exploration could involve applying the framework to more complex and diverse datasets or even expanding it to hyper-realistic generation tasks. Additionally, integrating advanced neural representations or enhancing the current models with larger-scale priors might lead to even more sophisticated generative capabilities.

Conclusion

The paper presents a comprehensive and effective solution for text-based 3D model generation by integrating 2D and 3D priors within a unified bidirectional diffusion framework. This method not only advances the state-of-the-art in generative modeling but also offers practical and theoretical insights into optimizing generation processes through efficient multi-domain collaboration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
  2. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  3. Learning implicit fields for generative shape modeling. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  4. SDFusion: Multimodal 3d shape completion, reconstruction, and generation. arXiv, 2022.
  5. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  6. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20637–20647, 2023.
  7. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion, 2023.
  8. Deep Floyd. If project. https://github.com/deep-floyd/IF, 2023.
  9. Sdm-net: Deep generative network for structured deformable mesh. ACM Transactions on Graphics (TOG), 38:1–15, 2019.
  10. Escaping plato’s cave: 3d shape from adversarial rendering. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  11. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  12. Octree transformer: Autoregressive 3d shape generation on hierarchically structured sequences. arXiv preprint arXiv:2111.12480, 2021.
  13. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  14. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  15. Clip-mesh: Generating textured meshes from text using pretrained image-text models. SIGGRAPH Asia 2022 Conference Papers, 2022.
  16. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  17. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
  18. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023a.
  19. Meshdiffusion: Score-based generative 3d mesh modeling. In International Conference on Learning Representations, 2023b.
  20. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In European Conference on Computer Vision, pages 210–227. Springer, 2022.
  21. Realfusion: 360 reconstruction of any object from a single image. In CVPR, 2023.
  22. Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
  23. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023.
  24. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
  25. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  26. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 165–174, 2019.
  27. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  28. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  31. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  32. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  33. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
  34. Deep unsupervised learning using nonequilibrium thermodynamics. In Conference on Robot Learning, pages 87–96. PMLR, 2017.
  35. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022.
  36. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  37. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  38. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems, pages 82–90, 2016.
  39. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  40. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4541–4550, 2019.
  41. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  42. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  43. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Lihe Ding (11 papers)
  2. Shaocong Dong (7 papers)
  3. Zhanpeng Huang (7 papers)
  4. Zibin Wang (7 papers)
  5. Yiyuan Zhang (21 papers)
  6. Kaixiong Gong (12 papers)
  7. Dan Xu (120 papers)
  8. Tianfan Xue (62 papers)
Citations (14)
X Twitter Logo Streamline Icon: https://streamlinehq.com