Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior (2312.06655v1)

Published 11 Dec 2023 in cs.CV, cs.GR, and cs.LG

Abstract: Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. stable-diffusion-xl-base-1.0. https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0. Accessed: 2023-08-29.
  2. Deepfloyd. https://huggingface.co/DeepFloyd. Accessed: 2023-08-25.
  3. Learning representations and generative models for 3d point clouds. In ICML, pages 40–49. PMLR, 2018.
  4. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  5. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  6. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, pages 5799–5809, 2021.
  7. Efficient geometry-aware 3d generative adversarial networks. In CVPR, pages 16123–16133, 2022.
  8. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  9. Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396, 2023a.
  10. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023b.
  11. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  12. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023b.
  13. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  14. On the canny edge detector. Pattern recognition, 34(3):721–725, 2001.
  15. 3d shape induction from 2d views of multiple objects. In 3DV, pages 402–411. IEEE, 2017.
  16. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurlPS, 35:31841–31854, 2022.
  17. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  18. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
  19. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  20. 3dgen: Triplane latent diffusion for textured mesh generation, 2023.
  21. Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In ICCV, pages 14072–14082, 2021.
  22. Escaping plato’s cave: 3d shape from adversarial rendering. In ICCV, pages 9984–9993, 2019.
  23. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022a.
  24. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022b.
  25. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  26. Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989, 2023.
  27. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  28. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  29. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  30. Design of an image edge detection filter using the sobel operator. IEEE Journal of solid-state circuits, 23(2):358–367, 1988.
  31. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Content creation for a 3d game with maya and unity 3d. Institute of Computer Graphics and Algorithms, Vienna University of Technology, 6:124, 2011.
  34. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  35. Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. Advances in Neural Information Processing Systems, 35:30923–30936, 2022.
  36. Generative ai meets 3d: A survey on text-to-3d in aigc era. arXiv preprint arXiv:2305.06131, 2023a.
  37. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023b.
  38. Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023c.
  39. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 300–309, 2023.
  40. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  41. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  42. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023b.
  43. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  44. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  45. Att3d: Amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349, 2023.
  46. Inverse graphics gan: Learning to generate 3d shapes from unstructured 2d data. arXiv preprint arXiv:2002.12674, 2020.
  47. Diffusion probabilistic models for 3d point cloud generation. In CVPR, pages 2837–2845, 2021.
  48. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  49. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023.
  50. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), pages 405–421. Springer, 2020.
  51. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
  52. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pages 1–8, 2022.
  53. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  54. Point-e: A system for generating 3d point clouds from complex prompts, 2022.
  55. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  56. Stylesdf: High-resolution 3d-consistent image and geometry generation. In CVPR, pages 13503–13513, 2022.
  57. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  58. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  59. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  60. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  61. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  62. High-resolution image synthesis with latent diffusion models, 2021.
  63. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  64. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  65. Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. NeurlPS, 35:33999–34011, 2022.
  66. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  67. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  68. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  69. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  70. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
  71. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  72. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  73. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  74. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  75. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  76. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  77. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation, 2022.
  78. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  79. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  80. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  81. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in neural information processing systems, 29, 2016.
  82. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
  83. Consistnet: Enforcing 3d consistency for multi-view images diffusion. arXiv preprint arXiv:2310.10343, 2023.
  84. Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6841–6850, 2023.
  85. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. arXiv preprint arXiv:2301.11445, 2023a.
  86. Adding conditional control to text-to-image diffusion models, 2023b.
  87. Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. arXiv preprint arXiv:2010.09125, 2020.
  88. Locally attentional sdf diffusion for controllable 3d shape generation. arXiv preprint arXiv:2305.04461, 2023.
  89. Open3d: A modern library for 3d data processing. arXiv preprint arXiv:1801.09847, 2018.
  90. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
  91. Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Fangfu Liu (17 papers)
  2. Diankun Wu (3 papers)
  3. Yi Wei (60 papers)
  4. Yongming Rao (50 papers)
  5. Yueqi Duan (47 papers)
Citations (18)