Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models (2404.07191v2)

Published 10 Apr 2024 in cs.CV

Abstract: We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Polydiff: Generating 3d polygonal meshes with diffusion models. arXiv preprint arXiv:2312.11417, 2023.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  3. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023.
  4. Bsp-net: Generating compact meshes via binary space partitioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 45–54, 2020.
  5. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024.
  6. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023.
  7. Diffusion-sdf: Conditional generative modeling of signed distance functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2262–2272, 2023.
  8. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  9. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024.
  10. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  11. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  12. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint arXiv:2403.12034, 2024.
  13. Openlrm: Open-source large reconstruction models. https://github.com/3DTopia/OpenLRM, 2023.
  14. LRM: Large reconstruction model for single image to 3d. In The Twelfth International Conference on Learning Representations, 2024.
  15. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 867–876, 2022.
  16. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  17. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024.
  18. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  19. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In The Twelfth International Conference on Learning Representations, 2024.
  20. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  21. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023a.
  22. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  23. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023b.
  24. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  25. Meshdiffusion: Score-based generative 3d mesh modeling. In The Eleventh International Conference on Learning Representations, 2023d.
  26. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  27. Getmesh: A controllable model for high-quality mesh generation and manipulation. arXiv preprint arXiv:2403.11990, 2024.
  28. Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12923–12932, 2023.
  29. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
  30. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  31. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023.
  32. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  33. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3504–3515, 2020.
  34. Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9964–9973, 2019.
  35. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
  36. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  39. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  40. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  41. Flexible isosurface extraction for gradient-based mesh optimization. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023.
  42. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  43. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  44. Diffusion-based signed distance fields for 3d shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20887–20897, 2023.
  45. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  46. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
  47. Gecco: Geometrically-conditioned point diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2128–2138, 2023.
  48. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024.
  49. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023.
  50. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52–67, 2018.
  51. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  52. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024a.
  53. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024b.
  54. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023a.
  55. Sketch and text guided diffusion model for colored point cloud generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8929–8939, 2023b.
  56. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4479–4489, 2023a.
  57. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20908–20918, 2023b.
  58. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. arXiv preprint arXiv:2403.14621, 2024.
  59. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023.
  60. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Transactions on Graphics (TOG), 42(4):1–13, 2023.
  61. Mvd2: Efficient multiview 3d reconstruction for multiview diffusion. arXiv preprint arXiv:2402.14253, 2024.
  62. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021.
  63. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147, 2023.
  64. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint arXiv:2403.12010, 2024.
Citations (93)

Summary

  • The paper introduces a feed-forward framework that leverages multiview diffusion and sparse-view reconstruction models to generate high-quality 3D meshes from a single image.
  • The approach integrates a differentiable iso-surface extraction module to optimize mesh quality and training efficiency, achieving generation in just 10 seconds.
  • The results demonstrate significant improvements in 3D consistency and scalability, advancing practical applications in virtual reality, gaming, and industrial design.

InstantMesh: An Open-Source Framework for Instant 3D Mesh Generation from Single Images

Introduction to 3D Mesh Generation

The rapid advancement of 3D generative artificial intelligence has opened avenues to transform single-view images into elaborate 3D models. These transformations find extensive applications across virtual reality, industrial design, gaming, and animation. However, the task of generating high-quality 3D assets from single images is far from trivial, primarily due to the inherent complexity of interpreting 3D information from 2D data and the limitations posed by the quality and scale of existing 3D datasets.

Previous Works and Their Limitations

Prior techniques have explored various approaches, including distilling 2D diffusion models into 3D representations and employing large reconstruction models (LRMs) to directly map image tokens to 3D outputs. While these methods have shown promising directions, their practical utility is hampered by issues such as lengthy generation times, multi-view inconsistencies (the "Janus" problem), and constraints in training efficiency and scalability.

InstantMesh: High-Quality and Efficient 3D Mesh Generation

InstantMesh addresses these challenges by introducing a novel, feed-forward framework aimed at generating high-quality 3D meshes from a single image. The framework combines the strengths of multiview diffusion models for generating 3D-consistent multi-view images and a sparse-view LRM for direct mesh prediction, all within a rapid 10-second turnaround. This approach leverages a differentiable iso-surface extraction module integrated within the framework, optimizing directly on the mesh representation and significantly improving training efficiency and output quality.

Technical Framework

InstantMesh's architecture consists of two primary components:

  • Multi-View Diffusion Model: This model synthesizes consistent multi-view images from an input image, enhancing 3D consistency.
  • Sparse-View Large Reconstruction Model: Tailored from Instant3D's approach, this model predicts 3D meshes from the generated multi-view images. It significantly benefits from the integration of FlexiCubes for iso-surface extraction, directly applying geometric supervisions to enhance the mesh's quality.

Key Improvements and Results

  • Enhanced 3D Consistency and Quality: By synergizing multiview generation with direct mesh prediction, InstantMesh significantly outperforms current image-to-3D models, achieving state-of-the-art results in both qualitative and quantitative evaluations.
  • Efficient and Scalable Training: The differentiable iso-surface extraction module enables the efficient use of high-resolution images and geometric data for supervising the model, resulting in smoother meshes and improved training scalability.

Practical Implications and Future Directions

InstantMesh presents a substantial leap towards realizing the potential of 3D generative AI in practical applications. It showcases the possibility of generating detailed, high-quality 3D assets from single images rapidly, opening new frontiers for content creators, researchers, and industries reliant on 3D modeling. Looking ahead, future iterations of this framework could explore enhancing resolution capabilities, improving multi-view consistency with advanced diffusion models, and increasing the model's efficacy in capturing fine details and complex structures.

Conclusion

The introduction of InstantMesh marks a significant advancement in image-to-3D asset generation, addressing major bottlenecks such as generation speed, multi-view consistency, and training efficiency. Its open-source release underscores the commitment to furthering research and application development in the 3D generative AI domain. As the field continues to evolve, InstantMesh offers a foundational model for new explorations and innovations, setting a new benchmark for rapid, high-quality 3D mesh generation from single images.

Youtube Logo Streamline Icon: https://streamlinehq.com