Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation (2411.02293v5)

Published 4 Nov 2024 in cs.CV and cs.AI

Abstract: While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D 1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 3x more parameters than our lite and other existing model. Our Hunyuan3D 1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Stablezero123. https://huggingface.co/stabilityai/stable-zero123. Accessed: 2024-02-22.
  2. A probabilistic framework for surface reconstruction from multiple images. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, pages II–II, 2001.
  3. Jeremy S. De Bonet. Poxels: Probabilistic voxelized volume reconstruction. In CVPR, 1999.
  4. Sf3d: Stable fast 3d mesh reconstruction with uv-unwrapping and illumination disentanglement. arXiv preprint, 2024.
  5. A probabilistic framework for space carving. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, pages 388–393 vol.1, 2001.
  6. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  7. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024.
  8. Defusr: Learning non-volumetric depth fusion using successive reprojections. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
  9. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  10. Learned multi-patch similarity. In Proceedings of the IEEE international conference on computer vision, pages 1586–1594, 2017.
  11. Openlrm: Open-source large reconstruction models. https://github.com/3DTopia/OpenLRM, 2023.
  12. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  13. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  14. Deepmvs: Learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  15. A global linear method for camera pose registration. In 2013 IEEE International Conference on Computer Vision, pages 481–488, 2013.
  16. Spad : Spatially aware multiview diffusers, 2024.
  17. A theory of shape by space carving. In Proceedings of the Seventh IEEE International Conference on Computer Vision, pages 307–314 vol.1, 1999.
  18. Shape reconstruction using volume sweeping and learned photoconsistency. In European Conference on Computer Vision, 2018.
  19. Grounding image matching in 3d with mast3r. arXiv preprint arXiv:2406.09756, 2024.
  20. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023.
  21. Era3d: High-resolution multiview diffusion using efficient row-wise attention. arXiv preprint arXiv:2405.11616, 2024.
  22. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  23. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023a.
  24. DIST: Rendering Deep Implicit Signed Distance Function With Differentiable Sphere Tracing. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
  25. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  26. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  27. Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH Comput. Graph., 21(4):163–169, 1987.
  28. Direct2.5: Diverse text-to-3d generation via multi-view 2.5d diffusion. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8744–8753, 2023.
  29. Efficient deep learning for stereo matching. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5695–5703, 2016.
  30. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  31. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.
  32. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In International Conference on Computer Vision (ICCV), 2021.
  33. Raynet: Learning volumetric 3d reconstruction with ray potentials. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  34. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  35. Octnetfusion: Learning depth fusion from data. In 2017 International Conference on 3D Vision (3DV), pages 57–66. IEEE, 2017.
  36. High-resolution image synthesis with latent diffusion models, 2021.
  37. Structure-from-motion revisited. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016.
  38. Photorealistic scene reconstruction by voxel coloring. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1067–1073, 1997.
  39. Zero123++: a single image to consistent multi-view diffusion base model, 2023a.
  40. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  41. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  42. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint 2307.01097, 2023.
  43. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
  44. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
  45. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2626–2634, 2017.
  46. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047, 2017.
  47. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. In European Conference on Computer Vision (ECCV), 2024.
  48. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  49. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021.
  50. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024a.
  51. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024b.
  52. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024c.
  53. Novel view synthesis with diffusion models, 2022.
  54. Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385, 2024.
  55. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  56. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
  57. Neural fields in visual computing and beyond. In Computer Graphics Forum, 2022.
  58. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models, 2024.
  59. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
  60. Viewfusion: Towards multi-view consistency via interpolated denoising, 2024.
  61. Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV), 2018a.
  62. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, 2018b.
  63. Recurrent mvsnet for high-resolution multi-view stereo depth inference. Computer Vision and Pattern Recognition (CVPR), 2019.
  64. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33:2492–2502, 2020.
  65. Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  66. Fast-mvsnet: Sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In CVPR, 2020.
  67. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015.
  68. Gs-lrm: Large reconstruction model for 3d gaussian splatting. In European Conference on Computer Vision, pages 1–19. Springer, 2025.
Citations (5)

Summary

  • The paper introduces a novel two-stage framework that enhances 3D asset quality and speed by combining multi-view diffusion and feed-forward reconstruction.
  • It leverages a fine-tuned 2D diffusion model with a zero-elevation camera orbit to capture detailed multi-view images, achieving fast image generation and reconstruction.
  • The research demonstrates superior performance on key metrics like Chamfer Distance and F-score, setting new benchmarks for text-to-3D and image-to-3D generation.

An Analytical Overview of Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

The paper "Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation" addresses significant challenges in the field of 3D generative modeling, particularly those associated with the inefficiencies in current diffusion models. The core proposition revolves around a two-stage procedural framework that enhances both the speed and quality of 3D asset generation, offering an integrated approach for both text-conditioned and image-conditioned 3D generation tasks.

Framework and Methodology

The proposed Hunyuan3D-1.0 framework implements a novel two-stage approach:

  1. Multi-view Diffusion Model: This first stage leverages a multi-view diffusion model to generate multi-view RGB images, which capture intricate details of 3D assets from different perspectives. This stage efficiently reduces the reconstruction task from a single-view to a multi-view problem, achieving approximately a 4-second generation time. The model employs a fine-tuned large-scale 2D diffusion model that generates consistent multi-view images by enhancing its understanding of 3D spatial information. Additionally, the adoption of a zero-elevation camera orbit maximizes visible areas between generated views.
  2. Feed-forward Reconstruction Model: In the subsequent stage, a feed-forward reconstruction model reconstructs the 3D asset using the generated images, completing the process in roughly 7 seconds. The reconstruction network is designed to handle noise and inconsistencies while utilizing conditional image information to accurately recover 3D structures.

The framework incorporates recent advancements in text-to-image diffusion models, notably the integration of the Hunyuan-DiT, making it extensible across text and image domains. In an innovative move, it introduces a larger model containing three times the parameters of the existing counterparts to scale the capacity and quality without sacrificing computational efficiency.

Quantitative and Qualitative Evaluations

The research outlines robust quantitative results, comparing favorably to existing state-of-the-art methods across benchmarks like GSO and OmniObject3D. The framework achieves superior performance, particularly noted in metrics such as Chamfer Distance (CD) and F-score for different thresholds, corroborating the efficacy of the proposed enhancements. Qualitatively, Hunyuan3D-1.0 exhibits greater accuracy in rendered textures and geometrical fidelity of complex structures.

Implications and Future Developments

This research signifies a substantial contribution to both the fields of computer vision and graphics. From a practical standpoint, it streamlines the creation of high-quality 3D assets, which is invaluable for gaming, virtual reality, and e-commerce. Theoretically, it challenges existing paradigms by demonstrating the feasibility of fast and generalized 3D generation within a unified framework.

Future avenues could explore extending the model to support even larger datasets or enhancing its integration with real-time applications. The interplay between hybrid inputs (pose-known and pose-unknown) and reconstructed outputs presents opportunities for further optimization and fine-tuning.

In conclusion, Hunyuan3D-1.0 provides a cogent framework that significantly advances the efficiency and quality of 3D generation, all while maintaining flexibility across different input modalities. The proposed methodologies present exciting pathways for continuing advancements in the automatic 3D asset generation landscape.

Youtube Logo Streamline Icon: https://streamlinehq.com