Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention (2311.17834v4)

Published 29 Nov 2023 in cs.CV and cs.GR

Abstract: We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. ChangeIt3D: Language-Assisted 3D Shape Edits and Deformations. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2.
  2. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. arXiv preprint arXiv:2304.08465 (2023).
  3. TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4169–4181.
  4. Shapenet: An Information-Rich 3D Model Repository. arXiv preprint arXiv:1512.03012 (2015).
  5. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–10.
  6. Text2Tex: Text-Driven Texture Synthesis via Diffusion Models. arXiv preprint arXiv:2303.11396 (2023).
  7. Text2Shape: Generating shapes From Natural Language By Learning Joint Embeddings. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14. Springer, 100–116.
  8. Fantasia3D: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation. arXiv preprint arXiv:2303.13873 (2023).
  9. Tango: Text-Driven Photorealistic and Robust 3D Stylization via Lighting Decomposition. arXiv preprint arXiv:2210.11277 (2022).
  10. SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4456–4465.
  11. Cross-Modal 3D Shape Generation and Manipulation. In European Conference on Computer Vision. Springer, 303–321.
  12. InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
  13. Objaverse-xl: A Universe of 10M+ 3D Objects. arXiv preprint arXiv:2307.05663 (2023).
  14. Objaverse: A Universe of Annotated 3D Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142–13153.
  15. Deformed Implicit Field: Modeling 3D Shapes with Learned Dense Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10286–10296.
  16. Stylegan-nada: Clip-guided Domain Adaptation of Image Generators. arXiv preprint arXiv:2108.00946 (2021).
  17. Surface Reconstruction Using Local Shape Priors. In Symposium on Geometry Processing. 253–262.
  18. Parsing Geometry using Structure-Aware Shape Templates. In 2018 International Conference on 3D Vision (3DV). IEEE, 672–681.
  19. GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. Advances In Neural Information Processing Systems 35 (2022), 31841–31854.
  20. TextDeformer: Geometry Manipulation using Text Guidance. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  21. VolTeMorph: Realtime, Controllable and Generalisable Animation of Volumetric Representations. arXiv preprint arXiv:2208.00949 (2022).
  22. Tokenflow: Consistent Diffusion Features for Consistent Video Editing. arXiv preprint arXiv:2307.10373 (2023).
  23. Dualsdf: Semantic Shape Manipulation Using a Two-Level Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7631–7641.
  24. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626 (2022).
  25. Spaghetti: Editing Implicit Shapes Through Part Aware Generation. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–20.
  26. Denoising Diffusion Probabilistic Models. Advances in neural information processing systems 33 (2020), 6840–6851.
  27. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  28. LADIS: Language Disentanglement for 3D Shape Editing. arXiv preprint arXiv:2212.05011 (2022).
  29. As-Rigid-as-Possible Shape Manipulation. ACM transactions on Graphics (TOG) 24, 3 (2005), 1134–1141.
  30. Zero-Shot Text-Guided Object Generation with Dream Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 867–876.
  31. KeyPointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12783–12792.
  32. Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions. arXiv preprint arXiv:2305.02463 (2023).
  33. Salad: Part-level Latent Diffusion for 3D Shape Generation and Manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14441–14451.
  34. Han-Hung Lee and Angel X Chang. 2022. Understanding Pure Clip Guidance for Voxel Grid NeRF Models. arXiv preprint arXiv:2209.15172 (2022).
  35. Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 811–818.
  36. Magic3D: High-resolution Text-To-3D Content Creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
  37. Editing Conditional Radiance Fields. In Proceedings of the IEEE/CVF international conference on computer vision. 5773–5783.
  38. EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1–12.
  39. Towards Implicit Text-Guided 3D Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17896–17906.
  40. Cross-Shape Attention for Part Segmentation of 3D Point Clouds. In Computer Graphics Forum, Vol. 42. Wiley Online Library, e14909.
  41. Joint-dependent Local Deformations for Hand Animation and Object Grasping. Technical Report. Canadian Inf. Process. Soc.
  42. Realfusion: 360deg Reconstruction of any Object from a Single Image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8446–8455.
  43. Sdedit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv preprint arXiv:2108.01073 (2021).
  44. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12663–12673.
  45. Text2Mesh: Text-Driven Neural Stylization for Meshes. In CVPR.
  46. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 65, 1 (2021), 99–106.
  47. Point-e: A System for Generating 3D Point Clouds from Complex Prompts. arXiv preprint arXiv:2212.08751 (2022).
  48. Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models. ICCV (2023).
  49. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988 (2022).
  50. Magic123: One Image to High-Quality 3D Object Generation using both 2D and 3D Diffusion Priors. arXiv preprint arXiv:2306.17843 (2023).
  51. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
  52. TEXTure: Text-Guided Texturing of 3D Shapes. arXiv preprint arXiv:2302.01721 (2023).
  53. High-resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  54. Dreambooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  55. Clip-Forge: Towards Zero-Shot Text-to-Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18603–18613.
  56. CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18339–18348.
  57. Completion and Reconstruction with Primitive Shapes. In Computer Graphics Forum, Vol. 28. Wiley Online Library, 503–512.
  58. Vox-E: Text-guided Voxel Editing of 3D Objects. arXiv preprint arXiv:2303.12048 (2023).
  59. Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. Advances in Neural Information Processing Systems 34 (2021), 6087–6101.
  60. 3D Point Cloud Generative Adversarial Network Based on Tree Structured Graph Convolutions.
  61. Learning Adaptive Hierarchical Cuboid Abstractions of 3D Shape Collections. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–13.
  62. Data-Driven Structural Priors for Shape Completion. ACM Transactions on Graphics (TOG) 34, 6 (2015), 1–11.
  63. Neural Shape Deformation Priors. Advances in Neural Information Processing Systems 35 (2022), 17117–17132.
  64. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).
  65. Generating Part-Aware Editable 3D Shapes without 3D Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4466–4478.
  66. Learning Shape Abstractions by Assembling Volumetric Primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2635–2643.
  67. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1921–1930.
  68. Clip-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3835–3844.
  69. Score jacobian chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12619–12629.
  70. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. arXiv preprint arXiv:2305.16213 (2023).
  71. Tune-a-Video: One-shot Tuning of Image Diffusion Models for Text-to-Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7623–7633.
  72. Neutex: Neural Texture Mapping for Volumetric Neural Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7119–7128.
  73. Tianhan Xu and Tatsuya Harada. 2022. Deforming Radiance Fields with Cages. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII. Springer, 159–175.
  74. Neumesh: Learning Disentangled Neural Mesh-Based Implicit Field for Geometry and Texture Editing. In European Conference on Computer Vision. Springer, 597–614.
  75. Kaizhi Yang and Xuejin Chen. 2021. Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–11.
  76. GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors. arXiv preprint arXiv:2310.08529 (2023).
  77. NeRF-editing: Geometry Editing of Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18353–18364.
  78. Adding Conditional Control to Text-To-Image Diffusion Mdels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  79. Locally Attentional SDF Diffusion for Controllable 3D Shape Generation. arXiv preprint arXiv:2305.04461 (2023).
  80. DreamEditor: Text-Driven 3D Scene Editing with Neural Fields. arXiv preprint arXiv:2306.13455 (2023).
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets