Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity (2401.00604v2)

Published 31 Dec 2023 in cs.CV

Abstract: Score distillation has emerged as one of the most prevalent approaches for text-to-3D asset synthesis. Essentially, score distillation updates 3D parameters by lifting and back-propagating scores averaged over different views. In this paper, we reveal that the gradient estimation in score distillation is inherent to high variance. Through the lens of variance reduction, the effectiveness of SDS and VSD can be interpreted as applications of various control variates to the Monte Carlo estimator of the distilled score. Motivated by this rethinking and based on Stein's identity, we propose a more general solution to reduce variance for score distillation, termed Stein Score Distillation (SSD). SSD incorporates control variates constructed by Stein identity, allowing for arbitrary baseline functions. This enables us to include flexible guidance priors and network architectures to explicitly optimize for variance reduction. In our experiments, the overall pipeline, dubbed SteinDreamer, is implemented by instantiating the control variate with a monocular depth estimator. The results suggest that SSD can effectively reduce the distillation variance and consistently improve visual quality for both object- and scene-level generation. Moreover, we demonstrate that SteinDreamer achieves faster convergence than existing methods due to more stable gradient updates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Learning gradient fields for shape generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp.  364–381. Springer, 2020.
  2. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16123–16133, 2022.
  3. Stochastic training of graph convolutional networks with variance reduction. arXiv preprint arXiv:1710.10568, 2017.
  4. Louis HY Chen. Poisson approximation for dependent trials. The Annals of Probability, 3(3):534–545, 1975.
  5. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023.
  6. High-precision lattice qcd confronts experiment. Physical Review Letters, 92(2):022001, 2004.
  7. A stein variational newton method. Advances in Neural Information Processing Systems, 31, 2018.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
  10. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  11. Measuring sample quality with stein’s method. Advances in neural information processing systems, 28, 2015.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  13. Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989, 2023.
  14. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  17. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  867–876, 2022.
  18. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  19. James T Kajiya. The rendering equation. In Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pp.  143–150, 1986.
  20. Collaborative score distillation for consistent visual editing. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  21. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  22. Action-depedent control variates for policy optimization via stein’s identity. arXiv preprint arXiv:1710.11198, 2017.
  23. Qiang Liu. Stein variational gradient descent as gradient flow. Advances in neural information processing systems, 30, 2017.
  24. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.
  25. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pp.  276–284. PMLR, 2016.
  26. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.
  27. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12663–12673, 2023.
  28. Sean Meyn. Control techniques for complex networks. Cambridge University Press, 2008.
  29. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp.  405–421. Springer, 2020.
  30. Neural control variates. ACM Transactions on Graphics (TOG), 39(6):1–19, 2020.
  31. Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022.
  32. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  33. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  34. Control functionals for monte carlo integration. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):695–718, 2017.
  35. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  36. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  37. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  38. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  39. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  12179–12188, 2021.
  40. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30, 2017.
  41. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  42. 3d neural field generation using triplane diffusion. arXiv preprint arXiv:2211.16677, 2022.
  43. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.  2256–2265. PMLR, 2015.
  44. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  45. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  46. Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, volume 6, pp.  583–603. University of California Press, 1972.
  47. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
  48. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  49. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  50. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12619–12629, 2023a.
  51. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  52. Nerfbusters: Removing ghostly artifacts from casually captured nerfs. arXiv preprint arXiv:2304.10532, 2023.
  53. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  54. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in neural information processing systems, 29, 2016.
  55. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 {{\{{\\\backslash\deg}}\}} views. arXiv preprint arXiv:2211.16431, 2022.
  56. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4541–4550, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Peihao Wang (43 papers)
  2. Zhiwen Fan (52 papers)
  3. Dejia Xu (37 papers)
  4. Dilin Wang (37 papers)
  5. Sreyas Mohan (20 papers)
  6. Forrest Iandola (23 papers)
  7. Rakesh Ranjan (44 papers)
  8. Yilei Li (21 papers)
  9. Qiang Liu (405 papers)
  10. Zhangyang Wang (375 papers)
  11. Vikas Chandra (75 papers)
Citations (13)

Summary

Overview of Text-to-3D Asset Synthesis

The synthesis of 3D assets from textual descriptions is an increasingly important area in computer graphics and vision, applicable to fields such as gaming, virtual reality, and filmmaking. Traditionally, developing 3D content from text prompts involves substantial human effort and resources. A recently advanced method for automating this process is score distillation, where 2D images are used to guide the creation of 3D models. This method harnesses the power of diffusion models, which have shown great success in generating detailed 2D imagery based on textual descriptions.

Challenges in Score Distillation

Despite recent progress, generating 3D models from textual prompts faces significant technical challenges. A fundamental issue with score distillation is the high variability, or variance, inherent in the gradient estimation process. It can lead to inefficient learning and less accurate 3D representations. Variance arises from the stochastic nature of the sampling process when rendering 2D projections from 3D models. This sampling process often has to be conducted with small batch sizes due to computational constraints, further compounding the issue. To address the problem, researchers have introduced control variates in the estimation process, which, when designed effectively, can significantly reduce variance.

Introducing Stein Score Distillation

Building upon the concept of control variates, the proposed method in the paper is the Stein Score Distillation (SSD). Stein's identity, a mathematical concept, serves as the basis for constructing these control variates, allowing flexibility in selecting the baseline functions and thus creating a broader class of control variates. The researchers introduced a variant called SteinDreamer, which utilizes SSD with a monocular depth estimator to refine the gradient calculation process. The results from their paper show that SSD can decrease distillation variance, aiding visual quality at both object and scene levels, and enhance convergence speed when compared to predecessor methods.

Experimental Validation

Extensive experiments were conducted to validate SteinDreamer across various scenarios. For object-level generation, SteinDreamer produced 3D models with more detailed textures and smoother geometry while avoiding common artifacts such as multi-face distortions. Scene-level generation tests showed that SSD-enabled generation results in sharper, more detailed imagery. Additionally, the iterative process demonstrates that SteinDreamer consistently achieves lower variance throughout the training process compared to existing methods.

Conclusion

In summary, the SteinDreamer pipeline, powered by SSD, represents a significant step forward in text-to-3D asset creation. It not only improves the visual fidelity of the generated 3D models but also accelerates the convergence of the generation process. SteinDreamer accommodates a more stable and reliable update mechanism for 3D parameters, making it an effective solution that could potentially streamline the creation of complex 3D content across multiple applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com