HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance (2305.18766v4)
Abstract: The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.
- Learning representations and generative models for 3d point clouds. In ICML, 2018.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, 2021.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022.
- Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023a.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023b.
- It3d: Improved text-to-3d generation with explicit view synthesis, 2023c.
- Learning implicit fields for generative shape modeling. In CVPR, 2019.
- SDFusion: Multimodal 3d shape completion, reconstruction, and generation. CVPR, 2023.
- Hyperdiffusion: Generating implicit neural fields with weight-space diffusion, 2023.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML, 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In ICCV, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Eva3d: Compositional 3d human generation from 2d image collections. ICLR, 2023.
- Lora: Low-rank adaptation of large language models. ICLR, 2022.
- Zero-shot text-guided object generation with dream fields. CVPR, 2022.
- Clip-mesh: Generating textured meshes from text using pretrained image-text models. SIGGRAPH Asia 2022 Conference Papers, December 2022.
- Adam: A method for stochastic optimization. ICLR, 2015.
- Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
- Infinicity: Infinite-scale city synthesis. ICCV, 2022.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023a.
- Zero-1-to-3: Zero-shot one image to 3d object, 2023b.
- Syncdreamer: Learning to generate multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
- Att3d: Amortized text-to-3d object synthesis. ICCV, 2023.
- Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis. ECCV, 2020.
- AutoSDF: Shape priors for 3d completion, reconstruction and generation. In CVPR, 2022.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
- Improved adversarial systems for 3d object generation and reconstruction. In CoRL, 2017.
- Denoising diffusion implicit models. ICLR, 2021.
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. CVPR, 2023a.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. NeurIPS, 2023b.
- Learning descriptor networks for 3d shape synthesis and analysis. In CVPR, 2018.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022.
- Sketch2model: View-aware 3d modeling from single free-hand sketches. In CVPR, 2021.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
- Diffusion probabilistic fields. In ICLR, 2023.
- Junzhe Zhu (6 papers)
- Peiye Zhuang (19 papers)
- Sanmi Koyejo (111 papers)