Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner (2405.14979v1)

Published 23 May 2024 in cs.GR and cs.CV

Abstract: We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling software. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we employ a 3D native diffusion model, which operates on latent space learned from latent set-based 3D representations, to generate coarse geometries with regular mesh topology in seconds. In particular, this process takes as input a text prompt or a reference image and leverages a powerful multi-view (MV) diffusion model to generate multiple views of the coarse geometry, which are fed into our MV-conditioned 3D diffusion model for generating the 3D geometry, significantly improving robustness and generalizability. Following that, a normal-based geometry refiner is used to significantly enhance the surface details. This refinement can be performed automatically, or interactively with user-supplied edits. Extensive experiments demonstrate that our method achieves high efficacy in producing superior-quality 3D assets compared to existing methods. HomePage: https://craftsman3d.github.io/, Code: https://github.com/wyysf-98/CraftsMan

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  2. Efficient geometry-aware 3D generative adversarial networks. In arXiv, 2021a.
  3. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5799–5809, 2021b.
  4. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
  6. Learning implicit fields for generative shape modeling. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5939–5948, 2019.
  7. Diffusion-sdf: Conditional generative modeling of signed distance functions. In International Conference on Computer Vision (ICCV), pages 2262–2272, 2023.
  8. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  9. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  10. Get3d: A generative model of high quality 3d textured shapes learned from images. In Advances In Neural Information Processing Systems, 2022.
  11. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  12. Stylenerf: A style-based 3d aware generator for high-resolution image synthesis. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=iUuzzTMUw9K.
  13. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
  14. Neural wavelet-domain diffusion for 3d shape generation. December 2022.
  15. 3d shape generation with grid-based implicit functions. in 2021 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2021.
  16. Perceiver: General perception with iterative attention. In International Conference on Machine Learning (ICML), pages 4651–4664. PMLR, 2021.
  17. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  18. Point cloud gan. arXiv preprint arXiv:1810.05795, 2018.
  19. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023.
  20. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. International Conference on Learning Representations (ICLR), 2024.
  21. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching, 2023.
  22. Magic3d: High-resolution text-to-3d content creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  23. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. 2024.
  24. Zero-1-to-3: Zero-shot one image to 3d object, 2023a.
  25. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  26. Meshdiffusion: Score-based generative 3d mesh modeling. In International Conference on Learning Representations, 2023c. URL https://openreview.net/forum?id=0cpM2ApF9p6.
  27. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  28. Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
  29. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
  30. Maxime. Quad Remesher. Exoside, 2024.
  31. Realfusion: 360 reconstruction of any object from a single image. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023. URL https://arxiv.org/abs/2302.10663.
  32. Occupancy networks: Learning 3d reconstruction in function space. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  33. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
  34. AutoSDF: Shape priors for 3d completion, reconstruction and generation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  35. Laser: Latent Set Representations for 3D Generative Modeling. arXiv, 2022. URL https://laser-nv-paper.github.io/.
  36. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning, pages 7220–7229. PMLR, 2020.
  37. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  38. Giraffe: Representing scenes as compositional generative neural feature fields. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 11453–11464, 2021.
  39. Werner Palfinger. Continuous remeshing for inverse rendering. Computer Animation and Virtual Worlds, 33(5):e2101, 2022.
  40. Deepsdf: Learning continuous signed distance functions for shape representation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 165–174, 2019.
  41. Film: Visual reasoning with a general conditioning layer. volume 32, 2018.
  42. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  43. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=0jHkUDyEO9.
  44. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021.
  45. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  46. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  47. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166, 2020.
  48. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
  49. 3d neural field generation using triplane diffusion. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 20875–20886, 2023.
  50. Pointgrow: Autoregressively learned point cloud generation with self-attention. In Winter Conference on Applications of Computer Vision, 2020.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  53. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024.
  54. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in Neural Information Processing Systems (NeurIPS), 29, 2016.
  55. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. arXiv preprint arXiv:2401.04092, 2024.
  56. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  57. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024.
  58. Pointflow: 3d point cloud generation with continuous normalizing flows. arXiv, 2019.
  59. Mosaic-sdf for 3d generative models. arXiv, 2023.
  60. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  61. 3DILG: Irregular latent grids for 3d generative modeling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems (NeurIPS), 2022.
  62. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (SIGGRAPH), 42(4), jul 2023a. ISSN 0730-0301. doi: 10.1145/3592442. URL https://doi.org/10.1145/3592442.
  63. Adding conditional control to text-to-image diffusion models, 2023b.
  64. Generative multiplane images: Making a 2d gan 3d-aware. In European Conference on Computer Vision, pages 18–35. Springer, 2022.
  65. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=xmxgMij3LY.
  66. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021.
Citations (11)

Summary

  • The paper introduces CraftsMan, which generates high-fidelity 3D models by integrating a 3D native diffusion model with a normal-based geometry refiner.
  • It employs a two-stage approach that first creates coarse geometries and then allows interactive user edits to refine intricate surface details.
  • Experimental results demonstrate that CraftsMan outperforms existing methods in Chamfer Distance and Volume IoU, setting new standards in mesh generation.

Overview of CraftsMan: A Generative 3D Modeling System

The paper presents a novel generative 3D modeling system, termed CraftsMan, which aims to generate high-fidelity 3D geometries featuring varied shapes, regular mesh topologies, and detailed surfaces. The distinguishing feature of CraftsMan is its capability to allow interactive refinement of generated geometries. Traditional 3D generation methods often struggle with time-consuming optimization processes, irregular mesh topologies, noisy surfaces, and limited capacity for user edits. CraftsMan addresses these issues by drawing inspiration from the workflow of a craftsman, who first roughs out the general shape and subsequently refines the intricate details.

System Architecture

The CraftsMan system comprises two key stages: a 3D native diffusion model for coarse geometry generation and a normal-based geometry refiner for enhancing surface details. This separation allows efficient and robust 3D asset creation from single reference images or text prompts.

  1. 3D Native Diffusion Model: This model operates in a latent space learned from 3D representations. By leveraging a multi-view (MV) diffusion model, CraftsMan generates multiple views of the coarse geometry which are subsequently fed into a 3D diffusion model. This approach greatly improves the robustness and generalizability of the generated 3D assets.
  2. Geometry Refinement: The refinement stage utilizes a normal-based geometry refiner to enhance surface details. This refinement can be performed automatically or interactively, allowing for user-supplied edits to the geometry. The process is underpinned by ControlNet-tile and surface normal map diffusion, facilitating efficient mesh optimization while maintaining the original topology.

Numerical Evaluation and Results

Extensive experiments demonstrate that CraftsMan significantly outperforms existing methods in generating high-quality 3D assets. The method was evaluated using the Google Scanned Object (GSO) dataset, with metrics such as Chamfer Distance (CD) and Volume Intersection over Union (IoU). Quantitative results show that CraftsMan achieves comparable or superior performance to current generative models.

For instance, CraftsMan recorded a Chamfer Distance of 0.0355 and a Volume IoU of 0.5092, outperforming methods like Point-E and Shap-E, which had higher Chamfer Distances and lower IoUs. Additionally, when compared to InstantMesh, which produces accurate geometries but lacks detail, CraftsMan generated intricate and faithful representations of the input prompts within significantly reduced inference times.

Implications and Future Directions

The implications of CraftsMan are multifaceted. Practically, it offers an efficient and user-friendly tool for industries such as video gaming, augmented reality, and film production, where rapid and detailed 3D asset creation is in high demand. Theoretically, the system sets a precedent for integrating multi-view conditions and interactive refinement within generative 3D modeling, thus addressing long-standing challenges in the field.

Future developments in this area might focus on enhancing the controllability of the Latent Set Diffusion Model and exploring methods for generating textures along with geometries. Additionally, further research into expanding and diversifying the 3D datasets used for training could significantly improve the generalizability and robustness of such models.

Conclusion

CraftsMan represents a significant advancement in generative 3D modeling, effectively bridging the gap between coarse geometry generation and detailed refinement. Its ability to combine multi-view diffusion conditions with interactive refinement opens new avenues for producing high-fidelity 3D assets efficiently. While there remain challenges and opportunities for further enhancement, CraftsMan demonstrates the potential for next-generation 3D modeling systems in both research and practical applications.