SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D (2310.02596v2)
Abstract: It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This "coarse" alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality objects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with an 85+% consistency rate by human evaluation, while many previous methods are around 30%. Our project page is https://sweetdreamer3d.github.io/
- Dreamfusion project webpage, 2023. URL https://dreamfusion3d.github.io/.
- Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
- Genvs: Generative novel view synthesis with 3d-aware diffusion models, 2023.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In International Conference on Computer Vision (ICCV), October 2023.
- Objaverse: A universe of annotated 3d objects. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13142–13153, 2023.
- threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
- Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023.
- Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
- DeepFloyd IF. Deepfloyd if, 2023. URL https://huggingface.co/DeepFloyd.
- Introducing superalignment, 2023. URL https://openai.com/blog/introducing-superalignment.
- Magic3d: High-resolution text-to-3d content creation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.
- Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (SIGGRAPH), 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML), pp. 16784–16804. PMLR, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations (ICLR), 2022.
- Zero-shot text-to-image generation. In International Conference on Machine Learning (ICML), pp. 8821–8831. PMLR, 2021.
- Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
- High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022b.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Scene coordinate regression forests for camera relocalization in rgb-d images. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2930–2937, 2013.
- A real-world dataset for multi-view 3d reconstruction. In European Conference on Computer Vision (ECCV), pp. 56–73, 2022.
- Textmesh: Generation of realistic 3d meshes from text prompts. 2023a.
- Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023b.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12619–12629, 2023a.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651, 2019.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
- Novel view synthesis with diffusion models. In International Conference on Learning Representations (ICLR), 2022.
- Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 803–814, 2023.
- Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12588–12597, 2023.
- Weiyu Li (33 papers)
- Rui Chen (310 papers)
- Xuelin Chen (17 papers)
- Ping Tan (101 papers)