Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model (2311.06214v2)

Published 10 Nov 2023 in cs.CV

Abstract: Text-to-3D with diffusion models has achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low-quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate diverse 3D assets of high visual quality within 20 seconds, which is two orders of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: https://jiahao.ai/instant3d/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Adobe. Adobe Firefly. https://firefly.adobe.com/, 2023.
  2. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
  3. Anonymous. Lrm: Large reconstruction model for single image to 3d. In Supplementary Files, 2023.
  4. Generative neural articulated radiance fields. Advances in Neural Information Processing Systems, 35:19900–19916, 2022.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  6. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
  7. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14124–14133, 2021.
  8. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
  9. Dictionary fields: Learning a neural basis decomposition. ACM Trans. Graph., 2023a.
  10. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023b.
  11. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023c.
  12. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  13. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019.
  14. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  15. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  16. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023b.
  17. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pp.  2553–2560. IEEE, 2022.
  18. Shapecrafter: A recursive text-conditioned 3d shape generation model. Advances in Neural Information Processing Systems, 35:8882–8895, 2022.
  19. Multi-view stereo: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015.
  20. 3d shape induction from 2d views of multiple objects. In 2017 International Conference on 3D Vision (3DV), pp. 402–411. IEEE, 2017.
  21. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
  22. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  23. Stylenerf: A style-based 3d aware generator for high-resolution image synthesis. In International Conference on Learning Representations, 2022.
  24. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  25. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  26. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  27. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp.  1501–1510, 2017.
  28. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  867–876, 2022.
  29. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  30. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  31. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  34. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
  35. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  36. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023a.
  37. Zero-1-to-3: Zero-shot one image to 3d object, 2023b.
  38. Meshdiffusion: Score-based generative 3d mesh modeling. arXiv preprint arXiv:2303.08133, 2023c.
  39. Towards implicit text-guided 3d shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17896–17906, 2022.
  40. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In European Conference on Computer Vision, pp.  210–227. Springer, 2022.
  41. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  42. Surfgen: Adversarial 3d shape synthesis with explicit surface discriminators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16238–16248, 2021.
  43. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2837–2845, 2021.
  44. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  45. Realfusion: 360 reconstruction of any object from a single image. In CVPR, 2023. URL https://arxiv.org/abs/2302.10663.
  46. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  47. Mixed precision training. In International Conference on Learning Representations, 2018.
  48. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  49. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, 2022.
  50. Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  306–315, 2022.
  51. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pp.  1–8, 2022.
  52. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.
  53. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  54. Autodecoding latent 3d diffusion models. arXiv preprint arXiv:2307.05445, 2023.
  55. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  56. StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13503–13513, June 2022.
  57. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  58. Convolutional generation of textured 3d meshes. Advances in Neural Information Processing Systems, 33:870–882, 2020.
  59. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  60. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  61. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  62. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  63. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  64. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  65. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  66. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  67. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10901–10911, 2021.
  68. High-resolution image synthesis with latent diffusion models, 2021.
  69. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  70. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18603–18613, 2022.
  71. Clip-sculptor: Zero-shot generation of high-fidelity and diverse shapes from natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18339–18348, 2023.
  72. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  73. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  74. Epigraf: Rethinking training of 3d gans. Advances in Neural Information Processing Systems, 35:24487–24501, 2022.
  75. Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, pp.  87–96. PMLR, 2017.
  76. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=St1giarCHLP.
  77. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  78. Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15182–15192, 2021.
  79. Pixel recurrent neural networks. In International conference on machine learning, pp. 1747–1756. PMLR, 2016.
  80. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  81. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12619–12629, 2023a.
  82. Ibrnet: Learning multi-view image-based rendering. In CVPR, 2021.
  83. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  84. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  85. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in neural information processing systems, 29, 2016.
  86. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
  87. Lion: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  88. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  89. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  90. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5826–5835, 2021.
Citations (173)

Summary

  • The paper demonstrates a two-stage pipeline combining sparse-view generation with a transformer-based 3D reconstruction model to rapidly produce 3D assets from text prompts.
  • It achieves a 20-second asset generation time, offering two orders of magnitude speed improvement over traditional optimization methods.
  • Quantitative CLIP-score assessments and diverse qualitative outputs underscore its competitive performance and potential for enhancing 3D content creation in VR and design.

Instant3D: Efficient Text-to-3D Asset Generation

The paper "Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model" introduces an innovative method for generating high-quality 3D assets from text prompts, termed Instant3D. The paper addresses the efficiency and quality issues often associated with existing text-to-3D methodologies—particularly those relying on score distillation-based optimization or feed-forward approaches plagued by data scarcity.

Methodological Framework

Instant3D operates via a two-stage pipeline:

  1. Sparse-view Generation: The authors leverage a two-stage approach wherein the first stage generates a sparse set of four structured views using a fine-tuned 2D diffusion model. This model is adept at producing high-quality 2D images from text, but traditionally struggles with replicating this quality in 3D outputs due to limited source data. To counter this limitation, the generation is executed in one shot, ensuring consistency across views. The technique hinges on fine-tuning existing robust diffusion models to adapt to sparse-view challenges, allowing for significant inference speed-ups.
  2. 3D Reconstruction: Following view generation, the sparse views are processed by a novel transformer-based reconstruction model that efficiently develops a NeRF representation from the few generated images. This reconstructor is remarkable in its ability to infer reliable geometry with only minimal visual input. At its core, the transformer architecture serves to effectively integrate vision transformers—a testament to its large parameter capacity and robustness in rendering quality outcomes even with restricted imaging data.

Quantitative and Qualitative Achievements

The Instant3D method stands out significantly in its efficiency, generating 3D assets within approximately 20 seconds—two orders of magnitude faster than prior optimization-dependent methods. The model is quantitatively validated against baselines, demonstrating competitive or superior CLIP-scored adherence to text-based prompts compared to existing models, including DreamFusion and ProlificDreamer.

Qualitative assessments further showcase Instant3D's capabilities in producing diverse outputs from singular prompts. This is particularly contrastive against methods constrained by identical generative pathways leading to lack of diversity. Instant3D upholds visual realism while mitigating known issues in diffusion generation such as oversaturation and multi-faced appearances (Janus problem).

Implications and Future Directions

The development of Instant3D presents profound implications for the domains of 3D asset creation, virtual reality environments, and rapid design iteration. By minimizing generation times while improving output adaptability and detail fidelity, it lays a foundation for more user-friendly and responsive generative systems. The paper suggests leveraging data priors from state-of-the-art 2D models, encouraging future research to explore ways to harness and expand on these modalities for enhanced 3D production.

Furthermore, the introduction of sparse-view imagery generation opens potential pathways for hybrid learning models, combining 3D and 2D data prompts to facilitate more complex and nuanced creative outputs. A promising direction would lie in extending these models to interactive applications, enabling them to dynamically respond to user inputs across both visual and textual domains.

The success of Instant3D delineates crucial progress in the text-to-3D pipeline, effectively positioning it as a transformative tool within both academic and industrial arenas demanding efficient multimedia asset synthesis.