Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior (2401.09050v2)

Published 17 Jan 2024 in cs.CV and cs.LG

Abstract: Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Kendall Atkinson. An Introduction to Numerical Analysis. John Wiley & Sons, 1991.
  2. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  3. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023.
  4. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  5. Score-based generative modeling with critically-damped langevin diffusion. In International Conference on Learning Representations, 2021.
  6. Fastnerf: High-fidelity neural rendering at 200fps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14346–14355, 2021.
  7. Diffusion models as plug-and-play priors. In Advances in Neural Information Processing Systems, 2022.
  8. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  9. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  10. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  11. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc., 2020.
  12. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  13. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
  14. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023.
  15. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  16. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022.
  17. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  18. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  19. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023.
  20. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  21. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2021.
  22. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  23. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, 2022.
  24. Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8446–8455, 2023.
  25. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
  26. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  27. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  28. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  29. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  30. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  31. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  34. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  35. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  36. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  37. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  38. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  39. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
  40. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  41. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
  42. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  43. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  44. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. arXiv preprint arXiv:2306.11719, 2023.
  45. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023a.
  46. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3295–3306, 2023b.
  47. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023c.
  48. Fast diffusion model, 2023.
  49. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. arXiv preprint arXiv:2208.06677, 2022.
  50. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022.
  51. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  52. gDDIM: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022.
  53. Improved order analysis and design of exponential integrator for diffusion models sampling. arXiv preprint arXiv:2308.02157, 2023.
  54. Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223, 2023.
  55. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
Citations (28)

Summary

  • The paper introduces Consistent3D, a novel method that uses an ODE-based framework and Consistency Distillation Sampling (CDS) to provide deterministic, consistent guidance for text-to-3D generation.
  • Consistent3D generates highly consistent, high-fidelity 3D objects and scenes, outperforming baseline methods like DreamFusion and Magic3D in qualitative and quantitative evaluations.
  • The research demonstrates the potential of deterministic ODE frameworks for robust generative tasks, paving the way for more reliable and efficient text-to-3D systems.

Consistent3D: Advancements in High-Fidelity Text-to-3D Generation

This essay provides an expert overview of the research paper "Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior". The paper introduces a novel approach for text-to-3D generation, addressing some of the main limitations found in current state-of-the-art methods such as Score Distillation Sampling (SDS). The authors propose a methodology that leverages deterministic sampling priors to enhance the consistency, fidelity, and diversity in the generation of 3D models from textual descriptions.

Background and Motivation

The paper notes that significant advancements have been achieved in text-to-3D generation due to large-scale datasets and pre-trained 2D diffusion models. However, the prevalent method, SDS, exhibits instability, often struggling with geometry collapse and producing textures that lack fidelity. The root cause identified is the stochastic nature of SDS, inherited from its alignment with the Stochastic Differential Equation (SDE) framework, which can introduce unpredictable variability into the model optimization process.

To remedy these shortcomings, the paper explores an alternative framework by aligning the sampling process with the corresponding Ordinary Differential Equation (ODE), theoretically capable of providing deterministic and consistent guidance for 3D model generation.

Methodology

The core contribution of the paper is the introduction of the "Consistent3D" method. This approach involves the transition from an SDE-based framework to an ODE-based one, which is expected to improve the predictability and reliability of the model's optimization trajectory.

To operationalize the ODE framework, the authors propose a Consistency Distillation Sampling (CDS) loss. The CDS loss is designed to distill deterministic sampling priors effectively into the 3D model by using a fixed Gaussian noise perturbation to maintain consistency throughout the training process. The theoretical underpinning suggests that consistent guidance in the learning process should alleviate issues with unreliable geometry and low-resolution textures. A novel time-step scheduling strategy is further implemented to engage with higher fidelity diffusion models progressively, allowing for better convergence and optimization of 3D representations.

Results and Implications

The experimental results are substantial, showcasing Consistent3D's ability to generate highly consistent, high-fidelity 3D objects and large-scale scenes from textual prompts. Notable is its performance improvement over baseline methods like DreamFusion and Magic3D, as reflected in both qualitative results and metrics such as CLIP R-Precision.

In theoretical terms, the methodology highlights how deterministic ODE frameworks can serve as more reliable mechanisms for complex generative tasks compared to their stochastic counterparts. Practically, this has implications for improving the robustness and efficiency of future text-to-3D systems. The deterministic nature also simplifies implementations in time-sensitive or resource-constrained settings like real-time rendering and interactive applications.

Future Directions

While the Consistent3D framework marks a significant step forward, the authors acknowledge limitations regarding biases inherent in pre-trained models and challenges in modeling intricate 3D scenarios. Addressing these areas could involve developing generative models that integrate robust 3D-centric training and devising techniques to mitigate undesired biases.

Continued exploration into deterministic frameworks and their applicability to other generative tasks may foster advancements in methodologies beyond text-to-3D, influencing practices in related fields such as virtual reality content creation, robotics, and beyond.

In summary, the Consistent3D framework offers a promising path for high-fidelity and consistent text-to-3D generation, setting a new standard in the integration of diffusion models with advanced sampling strategies.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets