Papers
Topics
Authors
Recent
Search
2000 character limit reached

Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Published 2 Jul 2024 in cs.CV, cs.AI, and cs.GR | (2407.02445v1)

Abstract: We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored shaded and albedo appearance channels, and then reconstructs colours, metalness and roughness in 3D, using a deferred shading loss for efficient supervision. It also uses a sign-distance function to represent 3D shape more reliably and introduces a corresponding loss for direct shape supervision. This is implemented using fused kernels for high memory efficiency. After mesh extraction, a texture refinement transformer operating in UV space significantly improves sharpness and details. AssetGen achieves 17% improvement in Chamfer Distance and 40% in LPIPS over the best concurrent work for few-view reconstruction, and a human preference of 72% over the best industry competitors of comparable speed, including those that support PBR. Project page with generated assets: https://assetgen.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6290–6301, June 2022.
  2. P. Beckmann and A. Spizzichino. The Scattering of Electromagnetic Waves from Rough Surfaces. Pergamon Press, 1963.
  3. NeRD: Neural Reflectance Decomposition from Image Collections. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  4. Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition. arXiv preprint, 2021.
  5. Brent Burley. Physically-based shading at disney. Technical report, Disney, 2012.
  6. Lightplane: Highly-scalable components for neural 3d fields. arXiv, 2024.
  7. Efficient geometry-aware 3D generative adversarial networks. In Proc. CVPR, 2022.
  8. Generative novel view synthesis with 3D-aware diffusion models. arXiv.cs, abs/2304.02602, 2023.
  9. TensoRF: Tensorial radiance fields. In arXiv, 2022.
  10. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv.cs, abs/2303.13873, 2023.
  11. Cascade-Zero123: One image to highly consistent 3D with self-prompted nearby views. arXiv.cs, abs/2312.04424, 2023.
  12. Text-to-3D using Gaussian splatting. arXiv, (2309.16585), 2023.
  13. V3D: Video diffusion models are effective 3D generators. arXiv, 2403.06738, 2024.
  14. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In Proc. ECCV, 2016.
  15. A reflectance model for computer graphics. In Doug Green, Tony Lucido, and Henry Fuchs, editors, Proc. SIGGRAPH, 1981.
  16. Emu: Enhancing image generation models using photogenic needles in a haystack. CoRR, abs/2309.15807, 2023.
  17. Improving neural implicit surfaces geometry with patch warping. In Proc. CVPR, 2022.
  18. An efficient method of triangulating equi-valued surfaces by using tetrahedral cells. IEICE TRANSACTIONS on Information and Systems, 74(1):214–224, 1991.
  19. Google Scanned Objects: A high-quality dataset of 3D scanned household items. In Proc. ICRA, 2022.
  20. Computer graphics - principles and practice, 3nd Edition. Addison-Wesley, 2013.
  21. Geo-Neus: Geometry-Consistent Neural Implicit Surfaces Learning for Multi-view Reconstruction. In NeurIPS, 2022.
  22. Learning deformable tetrahedral meshes for 3D reconstruction. In Proc. NeurIPS, 2020.
  23. CAT3D: Create Anything in 3D with Multi-View Diffusion Models. arXiv.cs, 2024.
  24. Differentiable Stereopsis: Meshes from multiple views using differentiable rendering. In CVPR, 2022.
  25. SuGaR: Surface-aligned Gaussian splatting for efficient 3D mesh reconstruction and high-quality mesh rendering. arXiv.cs, abs/2311.12775, 2023.
  26. 3DGen: Triplane latent diffusion for textured mesh generation. corr, abs/2303.05371, 2023.
  27. Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising. arXiv preprint, 2022.
  28. ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models. arXiv preprint, 2024.
  29. LRM: Large reconstruction model for single image to 3D. In Proc. ICLR, 2024.
  30. Dreamtime: An improved optimization strategy for text-to-3D content creation. CoRR, abs/2306.12422, 2023.
  31. Efficient-3Dim: Learning a generalizable single-image novel-view synthesizer in one day. arXiv, 2023.
  32. GaussianShader: 3D Gaussian splatting with shading functions for reflective surfaces. arXiv.cs, abs/2311.17977, 2023.
  33. Shape-E: Generating conditional 3D implicit functions. arXiv, 2023.
  34. Philip Torr Junlin Han, Filippos Kokkinos. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint, 2024.
  35. Learning category-specific mesh reconstruction from image collections. In Proc. ECCV, 2018.
  36. HoloDiffusion: training a 3D diffusion model using 2D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  37. 3D gaussian splatting for real-time radiance field rendering. Proc. SIGGRAPH, 42(4), 2023.
  38. Multi-view image prompted multi-view diffusion for improved 3D generation. arXiv, 2404.17419, 2024.
  39. Adam: A method for stochastic optimization. Proc. ICLR, 2015.
  40. Instant3D: Fast text-to-3D with sparse-view generation and large reconstruction model. Proc. ICLR, 2024.
  41. GS-IR: 3D Gaussian splatting for inverse rendering. arXiv.cs, abs/2311.16473, 2023.
  42. Magic3D: High-resolution text-to-3D content creation. arXiv.cs, abs/2211.10440, 2022.
  43. One-2-3-45++: Fast single image to 3D objects with consistent multi-view generation and 3D diffusion. arXiv.cs, abs/2311.07885, 2023.
  44. One-2-3-45: Any single image to 3D mesh in 45 seconds without per-shape optimization. In Proc. NeurIPS, 2023.
  45. Zero-1-to-3: Zero-shot one image to 3D object. In Proc. ICCV, 2023.
  46. Soft rasterizer: A differentiable renderer for image-based 3D reasoning. arXiv.cs, abs/1904.01786, 2019.
  47. SyncDreamer: Generating multiview-consistent images from a single-view image. arXiv, (2309.03453), 2023.
  48. UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation. arXiv preprint, 2023.
  49. Wonder3D: Single image to 3D using cross-domain diffusion. arXiv.cs, abs/2310.15008, 2023.
  50. ATT3D: amortized text-to-3D object synthesis. In Proc. ICCV, 2023.
  51. Scalable 3d captioning with pretrained models. arXiv preprint, 2023.
  52. IM-3D: Iterative multiview diffusion and reconstruction for high-quality 3D generation. arXiv preprint, (abs/2402.08682), 2024.
  53. RealFusion: 360 reconstruction of any object from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  54. HexaGen3D: Stablediffusion is just one step away from fast and diverse text-to-3D generation. arXiv, 2024.
  55. Occupancy Networks: Learning 3D Reconstruction in Function Space. In CVPR, 2019.
  56. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proc. ECCV, 2020.
  57. Differentiable blocks world: Qualitative 3d decomposition by rendering primitives. abs/2307.05473, 2023.
  58. Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency. In ECCV, 2022.
  59. DiffRF: Rendering-guided 3D radiance field diffusion. In Proc. CVPR, 2023.
  60. Instant neural graphics primitives with a multiresolution hash encoding. In Proc. SIGGRAPH, 2022.
  61. Extracting Triangular 3D Models, Materials, and Lighting From Images. In CVPR, 2022.
  62. Point-E: A system for generating 3D point clouds from complex prompts. arXiv.cs, abs/2212.08751, 2022.
  63. Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision. In CVPR, 2020.
  64. UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. arXiv.cs, abs/2104.10078, 2021.
  65. OpenAI. Triton: Open-source gpu programming for neural networks. https://github.com/triton-lang/triton.
  66. DreamFusion: Text-to-3D using 2D diffusion. In Proc. ICLR, 2023.
  67. Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv.cs, abs/2306.17843, 2023.
  68. Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3D. arXiv.cs, abs/2311.16918, 2023.
  69. Accelerating 3d deep learning with pytorch3d. arXiv, 2020.
  70. Christophe Schlick. An inexpensive BRDF model for physically-based rendering. Comput. Graph. Forum, 13(3), 1994.
  71. Zero123++: a single image to consistent multi-view diffusion base model. arXiv.cs, abs/2310.15110, 2023.
  72. MVDream: Multi-view diffusion for 3D generation. In Proc. ICLR, 2024.
  73. 3D neural field generation using triplane diffusion. arXiv.cs, abs/2211.16677, 2022.
  74. MeshGPT: Generating triangle meshes with decoder-only transformers. arXiv.cs, abs/2311.15475, 2023.
  75. B. Smith. Geometrical shadowing of a random rough surface. IEEE Trans. on Antennas and Propagation, 15(5), 1967.
  76. DreamCraft3D: Hierarchical 3D generation with bootstrapped diffusion prior. arXiv.cs, abs/2310.16818, 2023.
  77. Viewset diffusion: (0-)image-conditioned 3D generative models from 2D data. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  78. LGM: Large multi-view Gaussian model for high-resolution 3D content creation. arXiv, 2402.05054, 2024.
  79. DreamGaussian: Generative gaussian splatting for efficient 3D content creation. arXiv, (2309.16653), 2023.
  80. Make-It-3D: High-fidelity 3d creation from A single image with diffusion prior. arXiv.cs, abs/2303.14184, 2023.
  81. MVDiffusion++: A dense high-resolution multi-view diffusion model for single or sparse-view 3d object reconstruction. arXiv, 2402.12712, 2024.
  82. Luma Team. Luma genie 1.0. https://www.luma-ai.com/luma-genie-1-0/.
  83. Meshy Team. Meshy - AI 3D Model Generator with pbr materials— meshy.ai. https://www.meshy.ai/.
  84. TripoSR: fast 3D object reconstruction from a single image. 2403.02151, 2024.
  85. Theory for off-specular reflection from roughened surfaces. J. Opt. Soc. Am., 57(9), 1967.
  86. Llama 2: Open foundation and fine-tuned chat models, 7 2023.
  87. Attention is all you need. In NIPS, 2017.
  88. Microfacet models for refraction through rough surfaces. In Proc. Eurographics, 2007.
  89. Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. In CVPR, 2023.
  90. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv.cs, abs/2106.10689, 2021.
  91. ImageDream: Image-prompt multi-view diffusion for 3D generation. In Proc. ICLR, 2024.
  92. Rodin: A generative model for sculpting 3D digital avatars using diffusion. In Proc. CVPR, 2023.
  93. ProlificDreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. arXiv.cs, abs/2305.16213, 2023.
  94. CRM: Single image to 3D textured mesh with convolutional reconstruction model. arXiv, (2403.05034), 2024.
  95. MeshLRM: large reconstruction model for high-quality mesh. arXiv, 2404.12385, 2024.
  96. Consistent123: Improve consistency for one image to 3D object synthesis. arXiv, 2023.
  97. ReconFusion: 3D Reconstruction with Diffusion Priors. arXiv preprint, 2023.
  98. Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  99. LATTE3D: Large-scale amortized text-to-enhanced3D synthesis. In arXiv, 2024.
  100. NeRFactor: neural factorization of shape and reflectance under an unknown illumination. In Proc. SIGGRAPH, 2021.
  101. InstantMesh: efficient 3D mesh generation from a single image with sparse-view large reconstruction models. arXiv, 2404.07191, 2024.
  102. MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR. arXiv preprint, 2023.
  103. GRM: Large gaussian reconstruction model for efficient 3D reconstruction and generation. arXiv, 2403.14621, 2024.
  104. DMV3D: Denoising multi-view diffusion using 3D large reconstruction model. In Proc. ICLR, 2024.
  105. ConsistNet: Enforcing 3D consistency for multi-view images diffusion. arXiv.cs, abs/2310.10343, 2023.
  106. DreamComposer: Controllable 3D object generation via multi-view conditions. arXiv.cs, abs/2312.03611, 2023.
  107. Volume rendering of neural implicit surfaces. arXiv.cs, abs/2106.12052, 2021.
  108. Multiview neural surface reconstruction by disentangling geometry and appearance. In Proc. NeurIPS, 2020.
  109. Mosaic-SDF for 3D generative models. arXiv.cs, abs/2312.09222, 2023.
  110. GaussianDreamer: Fast generation from text to 3D gaussian splatting with point cloud priors. arXiv.cs, abs/2310.08529, 2023.
  111. Jonathan Young. Xatlas: Mesh parameterization / uv unwrapping library, 2022. GitHub repository.
  112. HiFi-123: Towards high-fidelity one image to 3D content generation. arXiv.cs, abs/2310.06744, 2023.
  113. NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild. In NeurIPS, 2021.
  114. GS-LRM: large reconstruction model for 3D Gaussian splatting. arXiv, 2404.19702, 2024.
  115. PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting. arXiv preprint, 2021.
  116. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR, pages 586–595, 2018.
  117. GALA3D: Towards text-to-3D complex scene generation via layout-guided generative gaussian splatting. arXiv.cs, abs/2402.07207, 2024.
  118. HiFA: High-fidelity text-to-3D with advanced diffusion guidance. CoRR, abs/2305.18766, 2023.
  119. Triplane meets Gaussian splatting: Fast and generalizable single-view 3D reconstruction with transformers. arXiv.cs, abs/2312.09147, 2023.
Citations (13)

Summary

  • The paper introduces a novel two-stage pipeline that uses SDF-driven geometry and dual-channel inputs to achieve explicit PBR material decomposition.
  • It leverages fine-tuned diffusion for multi-view image synthesis alongside a sparse-view neural reconstructor to generate high-fidelity 3D meshes.
  • Empirical results highlight a 17% reduction in Chamfer Distance, a 40% LPIPS improvement, and a 72% human preference rate over competing methods.

Meta 3D AssetGen: High-Fidelity Text-to-Mesh Generation with Physical Material Control

Introduction

Meta 3D AssetGen introduces an architecture for text- or image-driven 3D mesh generation with explicit physically-based rendering (PBR) material control, synergizing rapid generation speed with high-fidelity geometry and texture, and, critically, decomposed material properties for downstream relighting and graphics workflows. Unlike prior generators that primarily bake appearance into per-object textures, AssetGen enforces a disentangled albedo/metalness/roughness model, yielding relightable, reusable asset outputs. Figure 1

Figure 1: Meta 3D AssetGen: text/image-conditioned 3D mesh generator with PBR outputs (albedo, metalness, roughness), enabling high-fidelity relighting across environments.

Two-Stage Generation Architecture

AssetGen adopts a two-phase pipeline inspired by fast view-conditioned pipelines:

  1. Text-to-Multi-View Image Synthesis: A large-scale diffusion model, finetuned with 3D captions, predicts a 4-view grid of images per prompt. Each view contains a 3-channel shaded image and a 3-channel intrinsic albedo image, leveraging semantic scene priors for view-consistency and material ambiguity resolution.
  2. Conditional Image-to-3D Reconstruction: A sparse-view neural reconstructor processes the multiview, dual-channel input to yield a continuous SDF field and dense 3D fields for albedo, roughness, and metalness, enabling direct mesh extraction and material mapping. Texture refinement is performed in UV-space via a transformer conditioned on view projections for spatial detail recovery. Figure 2

    Figure 2: Overview of AssetGen’s pipeline: (blue) text-to-multiview with shaded/albedo, (orange) image-to-3D with SDF/UV PBR extraction, (green) UV-space texture refinement.

Key Implementation Details

  • SDF-Driven Geometry: Employing a triplane SDF representation improves isosurface extraction, geometric regularity, and supports direct supervision of signed-distance values, overcoming early volumetric-opaque-field artifacts.
  • Physically-Based Material Decomposition: Through multichannel input (shaded+albedo), the network exploits per-pixel Bruneton-style deferred shading losses, efficiently approximating ground-truth renderings during backprop with low memory overhead.
  • Texture Refinement: A UNet-based transformer fuses extracted mesh UVs with reprojected source views via cross-view attention, enhancing microtexture and repairing information loss during volumetric rendering.

Empirical Evaluation

Sparse-View 3D Reconstruction

AssetGen exhibits clear improvements over established image-to-3D baselines (Instant3D-LRM, InstantMesh, GRM, MeshLRM, LightplaneLRM) in both geometric and textural metrics. Specifically, a 17% reduction in Chamfer Distance and a 40% improvement in LPIPS are reported relative to state-of-the-art few-view mesh reconstructor baselines. Notably, geometry fidelity is substantially improved by direct SDF loss and SDF-based differentiable rendering. Figure 3

Figure 3: AssetGen qualitative performance on sparse view mesh reconstruction, showing superior geometry (orange overlays) and texture compared to baselines; texture refinement yields further improvements.

Text-to-3D Generation with PBR Decomposition

Compared to industry and academic solutions supporting PBR, AssetGen achieves a 72% preference rate in human studies over the best fast baselines (Meshy v3, Luma Genie) in both visual quality and text-prompt alignment, while operating at significantly reduced time-to-asset (30s asset synthesis vs. minutes for "refinement" competitors). Figure 4

Figure 4: Text-to-3D: AssetGen produces well-decoupled, high-fidelity materials; better metalness/roughness separation than Luma Genie and superior visual fidelity vs. baselines.

Figure 5

Figure 5: Demonstration of AssetGen’s PBR decomposition: precise albedo, metalness, and roughness separation, visually highlighting nuanced material control and relighting capability.

Evaluating Material Disentanglement

An explicit ablation of input strategies demonstrates that dual-channel input (shaded and albedo) outperforms single-channel or direct PBR prediction from the 2D generator—resolving material ambiguity and improving PBR map accuracy. Figure 6

Figure 6

Figure 6: Prompt-driven material control (“a cat made of <MATERIAL>”): AssetGen yields plausible, physically meaningful material maps under environmental lighting.

Texture Refiner and Loss Formulation

AssetGen’s UV texture refiner is built on a dual-stream UNet with cross-view attention, where the main branch receives the coarse PBR/UV and geometric attributes, and side branches accept backprojected source views. Cross-view attention enables per-texel aggregation of the sharpest view-aligned evidence, further regularized by a deferred shading loss computed in rendered image-space. Figure 7

Figure 7: (a) Architecture of the cross-view attention in the texture refiner; (b) deferred shading loss enforces physically correct material channel recovery by penalizing discrepancies in relit images.

Material Model and PBR Details

Meta 3D AssetGen’s material model implements a composite BRDF as:

f(,k(x),n)=Rπ+F(hn)D(hn)G1(n,)G1(n,)4(n)(n)f(,|k(x),n) = \frac{R}{\pi} + \frac{F(h|n) D(h|n) G_1(n,) G_1(n,)}{4 (n \cdot ) (n \cdot )}

where R=ρ0(1γ)R = \rho_0 (1-\gamma), F0=1(1γ)+ρ0γF_0 = \mathbf{1}(1-\gamma)+\rho_0\gamma, and k(x)=(ρ0,γ,α)k(x)=(\rho_0,\gamma,\alpha). The model is parameterized for pipeline compatibility with PBR-based shading engines and enables real-time relighting under dynamic environment maps.

Limitations and Future Directions

  • View Consistency: The 2D diffusion model, even when fine-tuned for view-consistency, is not always perfect, causing artefacts or Janus issues in extreme scenarios.
  • Sparse SDF Representation: The triplane SDF is optimal for memory scaling but is inherently inefficient for sparse or object-centric scenes, suggesting future transitions towards sparse voxel/octree encodings.
  • Limited Channel Scope: AssetGen focuses on RGB/metalness/roughness, omitting emissive and AO channels; further work is needed to support universal asset compatibility.
  • Object-Scale Only: Current evaluation is limited to object-scale 3D asset synthesis; the methodology provides a foundation for scene-scale and multi-object procedural modeling.

Implications and Prospects

AssetGen establishes a scalable paradigm for text-to-mesh 3D synthesis with full PBR material support at inference speeds amenable to interactive or iterative design loops. The explicit decomposition into relightable components advances the ecosystem for asset creation in AR/VR, digital content creation, and gaming, lowering the barrier for non-experts while maintaining integrability with physically-plausible render pipelines.

On the theoretical front, the combination of SDF-based geometric priors, dual-stream multi-view supervision, and UV-positioned refinement integrates the advantages of neural implicit and explicit mesh pipelines, challenging the established dichotomy in 3D asset generation.

Further research should investigate Octree/SparseHash-based volumetric fields for better scalability, integrate differentiable simulators for dynamic or physical property transfer, and extend material models to full spectral, emission, and subsurface domains.

Conclusion

Meta 3D AssetGen demonstrates that decoupling the stochastic material assignment from geometry prediction, leveraging an SDF-centric reconstruction pipeline, and executing UV-domain texture refinement enables high-quality, scalable, PBR-compatible text-to-mesh synthesis. Its results challenge the trade-off between synthesis speed and visual/material fidelity, and provide substantial empirical and methodological advances for future physically-plausible 3D asset generation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.