Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Make-A-Shape: a Ten-Million-scale 3D Shape Model (2401.11067v2)

Published 20 Jan 2024 in cs.CV and cs.GR

Abstract: Significant progress has been made in training large generative models for natural language and images. Yet, the advancement of 3D generative models is hindered by their substantial resource demands for training, along with inefficient, non-compact, and less expressive representations. This paper introduces Make-A-Shape, a new 3D generative model designed for efficient training on a vast scale, capable of utilizing 10 millions publicly-available shapes. Technical-wise, we first innovate a wavelet-tree representation to compactly encode shapes by formulating the subband coefficient filtering scheme to efficiently exploit coefficient relations. We then make the representation generatable by a diffusion model by devising the subband coefficients packing scheme to layout the representation in a low-resolution grid. Further, we derive the subband adaptive training strategy to train our model to effectively learn to generate coarse and detail wavelet coefficients. Last, we extend our framework to be controlled by additional input conditions to enable it to generate shapes from assorted modalities, e.g., single/multi-view images, point clouds, and low-resolution voxels. In our extensive set of experiments, we demonstrate various applications, such as unconditional generation, shape completion, and conditional generation on a wide range of modalities. Our approach not only surpasses the state of the art in delivering high-quality results but also efficiently generates shapes within a few seconds, often achieving this in just 2 seconds for most conditions. Our source code is available at https://github.com/AutodeskAILab/Make-a-Shape.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (110)
  1. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
  2. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  3. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  4. On visual similarity based 3d model retrieval. In Computer graphics forum, pages 223–232. Wiley Online Library, 2003.
  5. Shaddr: Real-time example-based geometry and texture generation via 3d shape detailization and differentiable rendering. arXiv preprint arXiv:2306.04889, 2023.
  6. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019.
  7. Decor-gan: 3d shape detailization by conditional refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15740–15749, 2021.
  8. Autoregressive 3d shape generation via canonical mapping. In European Conference on Computer Vision, pages 89–104. Springer, 2022.
  9. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023.
  10. Diffusion-sdf: Conditional generative modeling of signed distance functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2262–2272, 2023.
  11. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
  12. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  13. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023.
  14. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
  15. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  16. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  17. Sketchsampler: Sketch-based 3d reconstruction via view-dependent depth sampling. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pages 464–479. Springer, 2022a.
  18. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022b.
  19. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  20. Sketch2mesh: Reconstructing and editing 3d shapes from sketches. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13023–13032, 2021.
  21. Meshcnn: a network with an edge. ACM Transactions on Graphics (ToG), 38(4):1–12, 2019.
  22. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  23. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  24. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  25. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  26. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  27. Scalelong: Towards more stable training of diffusion model via scaling network long skip connection. arXiv preprint arXiv:2310.13545, 2023.
  28. Neural wavelet-domain diffusion for 3D shape generation. In ACM SIGGRAPH Asia, pages 1–9, 2022.
  29. 3d shape generation with grid-based implicit functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13559–13568, 2021.
  30. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  31. Uv-net: Learning from boundary representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11703–11712, 2021.
  32. Solidgen: An autoregressive model for direct b-rep synthesis. arXiv preprint arXiv:2203.13944, 2022.
  33. Shap-e: Generating conditional 3D implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  34. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  35. Discrete point flow networks for efficient point cloud generation. In European Conference on Computer Vision, pages 694–710. Springer, 2020.
  36. Abc: A big cad model dataset for geometric deep learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9601–9611, 2019.
  37. A diffusion-refinement model for sketch-to-point modeling. In Proceedings of the Asian Conference on Computer Vision, pages 1522–1538, 2022.
  38. Brepnet: A topological message passing system for solid models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12773–12782, 2021.
  39. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pages 3744–3753. PMLR, 2019.
  40. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023a.
  41. Diffusion-sdf: Text-to-shape via voxelized diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12642–12651, 2023b.
  42. 4dcomplete: Non-rigid motion estimation beyond the observable surface. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12706–12716, 2021.
  43. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  44. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023b.
  45. Fine-grained 3d shape classification with hierarchical part-view attention. IEEE Transactions on Image Processing, 30:1744–1758, 2021.
  46. Iss: Image as stetting stone for text-guided 3d shape generation. arXiv preprint arXiv:2209.04145, 2022.
  47. Exim: A hybrid explicit-implicit representation for text-guided 3d shape generation. ACM Transactions on Graphics (TOG), 42(6):1–12, 2023c.
  48. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  49. Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
  50. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  51. 3d shape reconstruction from sketches via multi-view convolutional networks. In 2017 International Conference on 3D Vision (3DV), pages 67–77. IEEE, 2017.
  52. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  53. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pages 37–45, 2015.
  54. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015.
  55. Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8446–8455, 2023.
  56. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
  57. Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
  58. Sked: Sketch-guided text-based 3d editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14607–14619, 2023.
  59. Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 306–315, 2022.
  60. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
  61. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning, pages 7220–7229. PMLR, 2020.
  62. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  63. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  64. Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 523–540. Springer, 2020.
  65. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  66. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
  67. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
  68. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
  69. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  70. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  71. Infinite photorealistic worlds using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023.
  72. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  73. Generating 3D faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pages 725–741, 2018.
  74. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  75. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  76. CLIP-Forge: Towards zero-shot text-to-shape generation. In CVPR, pages 18603–18613, 2022.
  77. Clip-sculptor: Zero-shot generation of high-fidelity and diverse shapes from natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18339–18348, 2023a.
  78. Sketch-a-shape: Zero-shot sketch-to-3d shape generation. arXiv preprint arXiv:2307.03869, 2023b.
  79. Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. Advances in Neural Information Processing Systems, 35:33999–34011, 2022.
  80. Buildingnet: Learning to label 3d buildings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10397–10407, 2021.
  81. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  82. 3D neural field generation using triplane diffusion. In CVPR, pages 20875–20886, 2023.
  83. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  84. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  85. Using shape to categorize: Low-shot learning with an explicit shape bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808, 2021.
  86. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
  87. Pointgrow: Autoregressively learned point cloud generation with self-attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 61–70, 2020.
  88. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2598–2606, 2018.
  89. Modelnet: Towards a datacenter emulation environment. In 2009 IEEE Ninth International Conference on Peer-to-Peer Computing, pages 81–82. IEEE, 2009.
  90. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  91. Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences. ACM Transactions on Graphics (TOG), 40(4):1–24, 2021.
  92. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in neural information processing systems, 29, 2016.
  93. Deepcad: A deep generative network for computer-aided design models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6772–6782, 2021.
  94. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018.
  95. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  96. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 views. arXiv e-prints, pages arXiv–2211, 2022.
  97. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20908–20918, 2023a.
  98. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023b.
  99. Shapeformer: Transformer-based shape completion via sparse representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6239–6249, 2022.
  100. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4541–4550, 2019.
  101. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  102. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  103. 3dilg: Irregular latent grids for 3d generative modeling. Advances in Neural Information Processing Systems, 35:21871–21885, 2022.
  104. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. arXiv preprint arXiv:2301.11445, 2023a.
  105. 3DShape2VecSet: A 3D shape representation for neural fields and generative diffusion models. 42(4), 2023b.
  106. Sdf-stylegan: Implicit sdf-based stylegan for 3d shape generation. In Computer Graphics Forum, pages 52–63. Wiley Online Library, 2022.
  107. Locally attentional sdf diffusion for controllable 3d shape generation. arXiv preprint arXiv:2305.04461, 2023.
  108. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
  109. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016.
  110. 3D menagerie: Modeling the 3D shape and pose of animals. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
Citations (8)

Summary

  • The paper introduces a wavelet-tree representation that nearly losslessly encodes high-resolution 3D shapes at a ten-million scale.
  • It employs a diffusion model with subband adaptive training to capture both coarse structures and fine textures efficiently.
  • The framework supports conditional generation from diverse inputs such as images, point clouds, and voxels, enabling tasks like zero-shot shape completion.

Introduction

In the pursuit of more advanced 3D generative models, there remains a gap in representation efficacy and training efficiency on large datasets. To bridge this, the introduced Make-A-Shape framework is pivotal. Offering a comprehensive approach for efficient large-scale 3D model training, this framework proficiently handles over ten million shapes, showcasing a leap forward in addressing the prevalent issues in 3D generative modeling.

The Wavelet-Tree Representation

Make-A-Shape innovates with the wavelet-tree representation, adopting a wavelet decomposition on a high-resolution SDF grid. This yields a representation that retains coarse and detail subband coefficients, marrying expressiveness with compactness—a vital advantage in streaming and training on extensive 3D shape datasets. By harnessing these coefficients rather than discarding high-frequency details for learning efficiency, the representation nearly losslessly encodes 3D shapes. This stands in contrast to prior models that tend to lose detail for efficiency.

Efficient Training with the Diffusion Model

The model transcends the limitations of inefficient learning by packing wavelet-tree coefficients into a diffusible grid layout, amenable for a diffusion-based generative model. A subband adaptive training strategy ensures the model captures the full spectrum of shape details, from coarse structure to fine textures, sans the collapse or ineffectual learning that could arise from a naive Mean Squared Error application.

Conditional Generation Capability

Make-A-Shape also extends its utility to conditional generation, handling a variety of inputs. Different modalities, including single/multi-view images, point clouds, and low-resolution voxels, are accommodated by converting conditions into latent vectors, followed by employing these vectors in the generative network. This modular approach enables the framework to adapt to diverse inputs effortlessly, a characteristic that positions it for practical applications where conditions might differ significantly.

Experiments and Results

The model's proficiency is evidenced by extensive experimental validation. It generates conditions cognizant 3D shapes, outperforming the state-of-the-art, particularly with image inputs, where it demonstrates superior capability in rendering the visible parts of objects while presenting credible variations for the unseen segments. The framework also shows adaptability, swiftly adjusting to point cloud density variations and voxel resolutions without sacrificing quality.

Importantly, the framework paves the way for tasks beyond generation, such as zero-shot shape completion, where it can inventively fill gaps in partial inputs. This versatility extends the utility of Make-A-Shape into domains where object restoration or extrapolation is essential.

Conclusions and Future Directions

Make-A-Shape heralds a new era in large-scale 3D shape modeling, providing a route to train generative models that can synthesize superior quality outputs rapidly. One limitation, however, is the model's inclination towards certain object categories due to training data imbalance. Additionally, the current focus is solely on geometry without considerations for texture. Future works could aim to mitigate these limitations by exploring category annotations and introducing texture to the generative process. The promise that Make-A-Shape holds for the advancement of 3D content creation, simulation, and possibly even virtual reality and gaming, is substantial and exciting.