Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors (2312.04963v1)

Published 7 Dec 2023 in cs.CV and cs.AI

Abstract: Most 3D generation research focuses on up-projecting 2D foundation models into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS) loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these methods often lead to geometric anomalies and multi-view inconsistency. Recently, researchers have attempted to improve the genuineness of 3D objects by directly training on 3D datasets, albeit at the cost of low-quality texture generation due to the limited texture diversity in 3D datasets. To harness the advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a unified framework that incorporates both a 3D and a 2D diffusion process, to preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a simple combination may yield inconsistent generation results, we further bridge them with novel bidirectional guidance. In addition, our method can be used as an initialization of optimization-based models to further improve the quality of 3D model and efficiency of optimization, reducing the generation process from 3.4 hours to 20 minutes. Experimental results have shown that our model achieves high-quality, diverse, and scalable 3D generation. Project website: https://bidiff.github.io/.

References (43)

Authors (8)

Lihe Ding (11 papers)
Shaocong Dong (7 papers)
Zhanpeng Huang (7 papers)
Zibin Wang (7 papers)
Yiyuan Zhang (21 papers)
Kaixiong Gong (12 papers)
Dan Xu (120 papers)
Tianfan Xue (62 papers)

Citations (14)

View on Semantic Scholar

Summary

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D Priors

The paper "Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors" proposes a novel framework named Bidirectional Diffusion (BiDiff) that advances the field of generative models by addressing challenges in text-to-3D object generation. This work effectively integrates both 2D and 3D generative processes to enhance the fidelity and consistency of generated 3D models through a dual diffusion approach.

Methodology Overview

The research builds on existing approaches by focusing on a bidirectional integration of both 2D and 3D priors. Bidirectional Diffusion represents a 3D object using a hybrid of 3D Signed Distance Fields (SDF) and multi-view 2D images, leveraging the strengths of both representations. Specifically, the proposed method employs a 2D diffusion model to enrich textural variety, while a 3D diffusion model ensures geometric accuracy and consistency.

The core innovation lies in the bidirectional guidance that aligns the two diffusion processes: the 3D generation is guided by denoised 2D outputs to ensure textural consistency, and vice versa, the 2D generation is guided by the 3D-generated intermediate states to maintain geometric coherence. This means intermediate results from both processes guide subsequent stages, adjusting each other's trajectories.

Practical Implications and Results

The framework demonstrates a markedly reduced processing time when compared to traditional optimization-based approaches, claiming the ability to produce high-quality 3D models in approximately 40 seconds, whereas conventional methods can take several hours. This is achieved without sacrificing the diversity and quality of the generated textures and geometries.

Quantitative evaluation through metrics such as CLIP R-Precision highlights that the BiDiff framework achieves competitive, if not superior, performance to other state-of-the-art methods while significantly enhancing computational efficiency. Furthermore, BiDiff provides an advantageous initialization for optimization-based methods, reducing refinement times and improving final model quality.

Theoretical Contributions

From a theoretical standpoint, this research introduces a cohesive mechanism to combine and synchronize 2D and 3D generative processes, which were previously applied in isolation. The fusion and mutual guidance between 2D and 3D diffusion models address inherent challenges such as multi-view inconsistency and geometric anomalies typical of unidirectional methods. The use of 3D priors from Shap-E further enhances the geometric robustness of the generated structures, while the inclusion of a comprehensive 2D prior ensures high-quality and texturally rich outcomes.

Future Directions

The demonstrated scalability and effectiveness of the BiDiff framework suggest promising future directions. Further exploration could involve applying the framework to more complex and diverse datasets or even expanding it to hyper-realistic generation tasks. Additionally, integrating advanced neural representations or enhancing the current models with larger-scale priors might lead to even more sophisticated generative capabilities.

Conclusion

The paper presents a comprehensive and effective solution for text-based 3D model generation by integrating 2D and 3D priors within a unified bidirectional diffusion framework. This method not only advances the state-of-the-art in generative modeling but also offers practical and theoretical insights into optimizing generation processes through efficient multi-domain collaboration.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/22146921/status/1734332300635275578