Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning (2405.08054v1)

Published 13 May 2024 in cs.GR and cs.CV

Abstract: As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Learning representations and generative models for 3d point clouds. In International conference on machine learning. PMLR, 40–49.
  2. GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image. arXiv preprint arXiv:2404.02152 (2024).
  3. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20919–20929.
  4. Multidiffusion: Fusing diffusion paths for controlled image generation. (2023).
  5. LooseControl: Lifting ControlNet for Generalized Depth Conditioning. arXiv preprint arXiv:2312.03079 (2023).
  6. Ollin Boer Bohan. 2023. Tiny AutoEncoder for Stable Diffusion. https://github.com/madebyollin/taesd. Accessed: 2023-10-03.
  7. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16123–16133.
  8. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
  9. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
  10. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023).
  11. Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia. 1148–1156.
  12. Sem2nerf: Converting single-view semantic masks to neural radiance fields. In European Conference on Computer Vision. Springer, 730–748.
  13. Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts. arXiv preprint arXiv:2310.11784 (2023).
  14. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4456–4465.
  15. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 628–644.
  16. Set-the-scene: Global-local training for generating controllable nerf scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2920–2929.
  17. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142–13153.
  18. 3d-aware conditional image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4434–4445.
  19. Composite shape modeling via latent space factorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8140–8149.
  20. Arpad E Elo. 1967. The proposed uscf rating system, its development, theory, and applications. Chess life 22, 8 (1967), 242–247.
  21. Blobgan: Spatially disentangled scene representations. In European Conference on Computer Vision. Springer, 616–635.
  22. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 605–613.
  23. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems 35 (2022), 31841–31854.
  24. A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 216–224.
  25. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021).
  26. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023).
  27. Gustavosta. 2023. MagicPrompt. https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion. Accessed: 2023-10-03.
  28. Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14072–14082.
  29. Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789 (2023).
  30. T3Bench: Benchmarking Current Progress in Text-to-3D Generation. arXiv:2310.02977 [cs.CV]
  31. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535 (2022).
  32. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023).
  33. Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–11.
  34. Heewoo Jun and Alex Nichol. 2023. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023).
  35. Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion. arXiv preprint arXiv:2303.15780 (2023).
  36. Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV). 371–386.
  37. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026.
  38. Cuboids revisited: Learning robust 3d shape fitting to single rgb images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13070–13079.
  39. Diffusion-sdf: Text-to-shape via voxelized diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12642–12651.
  40. Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608 (2023).
  41. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
  42. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885 (2023).
  43. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928 (2023).
  44. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9298–9309.
  45. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv preprint arXiv:2309.03453 (2023).
  46. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023).
  47. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11461–11471.
  48. Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8446–8455.
  49. Face generation and editing with stylegan: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  50. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
  51. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460–4470.
  52. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12663–12673.
  53. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
  54. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning. PMLR, 7220–7229.
  55. Swiya Nath and Dénes Szücs. 2014. Construction play and cognitive skills associated with the development of mathematical abilities in 7-year-old children. Learning and instruction 32 (2014), 73–80.
  56. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022).
  57. Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D. arXiv preprint arXiv:2312.02190 (2023).
  58. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165–174.
  59. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
  60. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023).
  61. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  62. Improving language understanding by generative pre-training. (2018).
  63. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508 (2023).
  64. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
  65. TEXTure: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721 (2023).
  66. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
  67. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  68. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18603–18613.
  69. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937 (2023).
  70. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems 34 (2021), 6087–6101.
  71. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023).
  72. MVDream: Multi-view Diffusion for 3D Generation. arXiv:2308.16512 (2023).
  73. Epigraf: Rethinking training of 3d gans. Advances in Neural Information Processing Systems 35 (2022), 24487–24501.
  74. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
  75. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023).
  76. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184 (2023).
  77. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12619–12629.
  78. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV). 52–67.
  79. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689 (2021).
  80. PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction. arXiv preprint arXiv:2311.12024 (2023).
  81. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  82. Marrnet: 3d shape reconstruction via 2.5 d sketches. Advances in neural information processing systems 30 (2017).
  83. GPT-4V (ision) is a Human-Aligned Evaluator for Text-to-3D Generation. arXiv preprint arXiv:2401.04092 (2024).
  84. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision. 2690–2698.
  85. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. arXiv:2304.05977 [cs.CV]
  86. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20908–20918.
  87. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217 (2023).
  88. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision. Springer, 597–614.
  89. Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. In 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 650–660.
  90. Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–10.
  91. Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 13779–13788.
  92. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV). 767–783.
  93. Pushing the Limits of 3D Shape Generation at Scale. arXiv preprint arXiv:2306.11510 (2023).
  94. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  95. RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion Models. arXiv preprint arXiv:2306.05668 (2023).
Citations (3)

Summary

  • The paper introduces a novel proxy-guided diffusion process that converts basic geometric shapes into detailed 3D assets.
  • The methodology integrates voxelized proxies with a multi-view diffusion model to enable interactive, fine-grained 3D model editing.
  • Performance evaluations and user studies demonstrate Coin3D's superior fidelity and accessibility compared to existing 3D generation methods.

Coin3D: Interactive and Controllable 3D Asset Generation

Creating 3D assets has often been a specialized task requiring significant expertise in modeling software. But what if we could simplify this process, making it accessible even to those without advanced skills? That's where Coin3D steps in with a fresh perspective on 3D asset generation.

What is Coin3D?

Coin3D is a framework that allows users to create 3D objects interactively and with ease. It takes the principles of 2D diffusion models used for generating images and adapts them for 3D. Instead of demanding extensive modeling knowledge, users can start with basic shapes—think cubes, spheres, and cylinders—and assemble these into a coarse proxy of the desired object. Then, Coin3D uses this proxy to generate detailed 3D assets.

Key Features of Coin3D

3D-Aware Control with Proxies

At its core, Coin3D uses simple geometric proxies as guides for generating 3D models. These proxies can be anything from a basic stack of shapes to more complex assemblies. Users can create these proxies using familiar tools like Tinkercad or Blender. By voxelizing these shapes and integrating them into a multi-view diffusion process, Coin3D can generate detailed 3D objects that closely follow the intended design.

Here's how the process works:

  1. Input Creation: Users create a proxy using basic shapes and add corresponding text prompts.
  2. Feature Extraction: The proxy is voxelized into a feature volume that guides the 3D generation.
  3. Diffusion Process: This volume integrates with a multi-view diffusion model to produce consistent images from different angles, ensuring that all views of the 3D object are coherent.

Interactive and Responsive Generation

One of the standout features of Coin3D is its interactive workflow. Users can not only generate entire models but also make fine-grained edits to specific parts. For example, you could start with a basic model of a car and then interactively add or modify parts like wheels or mirrors without needing to redo the whole model. This involves:

  • Proxy-Bounded Part Editing: Users can designate and edit specific parts of the proxy. The system ensures that only the selected parts are updated, while the rest remains consistent.
  • Progressive Volume Caching: To enable quick previews from any angle, Coin3D caches the volumetric information. This means users can see the results of their edits almost instantly, making the modeling process much more intuitive.

Consistent 3D Reconstruction

Generating images from multiple angles is one thing, but ensuring these images translate into a consistent 3D model is another challenge. Coin3D addresses this with a volume-conditioned reconstruction strategy. By leveraging the 3D control volume during the reconstruction phase, Coin3D provides high-fidelity 3D models suitable for further use in computer graphics applications.

Performance and Comparisons

In terms of results, Coin3D demonstrates robust performance. The authors of the paper evaluated Coin3D against some existing methods like Wonder3D and SyncDreamer. The key metrics for evaluation included:

  • CLIP Score: Measuring how well the generated object matches the text description.
  • ImageReward and GPTEvals3D: Assessing the perceptual quality of the generated views.
  • User Studies: Collecting feedback on user satisfaction with the generated models.

Across these metrics, Coin3D consistently showed better performance, particularly in how closely the generated objects matched the provided proxies and descriptions. This suggests that the 3D-aware control significantly enhances the quality and usability of generated models.

Implications

The implications of Coin3D extend beyond simply making 3D modeling easier. By providing an accessible way to create and edit 3D models interactively, Coin3D could democratize 3D content creation. This means more artists, designers, and even hobbyists could start creating high-quality 3D assets without needing deep expertise in 3D software.

Future Directions

While Coin3D already shows significant promise, there are clear paths for future enhancement:

  • Broader Shape Library: Expanding the basic shapes available for proxies could make the system even more versatile.
  • Real-Time Collaboration: Integrating real-time collaborative features could further enhance its utility for team-based projects.
  • Advanced Editing Tools: Adding more sophisticated editing capabilities, such as texture manipulation or physics-based simulation, could push the boundaries of what users can create.

In essence, Coin3D represents a significant step forward in making 3D modeling more intuitive, interactive, and accessible to a wider audience. By leveraging both 3D-aware control and responsive workflows, it paves the way for a future where anyone can bring their 3D ideas to life with ease.

Youtube Logo Streamline Icon: https://streamlinehq.com