Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing (2403.12032v2)

Published 18 Mar 2024 in cs.CV and cs.GR

Abstract: Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Cross-Image Attention for Zero-Shot Appearance Transfer. arXiv:2311.03335 [cs.CV]
  2. RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation. In CVPR.
  3. GAUDI: A Neural Architect for Immersive 3D Scene Generation. In NeurIPS.
  4. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
  5. TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models. In ICCV.
  6. Efficient Geometry-aware 3D Generative Adversarial Networks. In CVPR.
  7. GeNVS: Generative Novel View Synthesis with 3D-Aware Diffusion Models. In ICCV.
  8. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR]. Stanford University — Princeton University — Toyota Technological Institute at Chicago.
  9. Text2Tex: Text-driven Texture Synthesis via Diffusion Models. In ICCV.
  10. Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction. In ICCV.
  11. Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In ICCV.
  12. Objaverse: A Universe of Annotated 3D Objects. In CVPR.
  13. Google scanned objects: A high-quality dataset of 3d scanned household items. In ICRA. 2553–2560.
  14. From data to functa: Your data point is a function and you can treat it like one. In ICML.
  15. Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans. In ICCV. 10786–10796.
  16. Arpad E Elo. 1967. The proposed uscf rating system, its development, theory, and applications. Chess Life 22, 8 (1967), 242–247.
  17. NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. In ICML.
  18. 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371 [cs.CV]
  19. Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. In ICCV.
  20. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS.
  21. Denoising Diffusion Probabilistic Models. In NeurIPS.
  22. Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS Workshop.
  23. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR. https://openreview.net/forum?id=nZeVKeeFYf9
  24. Zero-Shot Text-Guided Object Generation with Dream Fields.
  25. InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering. In CVPR.
  26. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  27. TRACER: Extreme Attention Guided Salient Object Tracing Network. In AAAI.
  28. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.
  29. Magic3D: High-Resolution Text-to-3D Content Creation. In CVPR.
  30. One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion. In CVPR.
  31. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS.
  32. Zero-1-to-3: Zero-shot One Image to 3D Object. In ICCV.
  33. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. In ICLR.
  34. Wonder3D: Single Image to 3D using Cross-Domain Diffusion. In CVPR.
  35. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In NeurIPS.
  36. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR. 11461–11471.
  37. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In ICLR.
  38. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. In CVPR.
  39. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
  40. DiffRF: Rendering-Guided 3D Radiance Field Diffusion. In CVPR.
  41. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics 41, 4, Article 102 (July 2022), 15 pages. https://doi.org/10.1145/3528223.3530127
  42. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  43. Compositional 3D Scene Generation using Locally Conditioned Diffusion. In 3DV.
  44. DreamFusion: Text-to-3D using 2D Diffusion. In ICLR.
  45. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors. In ICLR.
  46. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.
  47. Texture: Text-guided texturing of 3d shapes. In SIGGRAPH.
  48. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR.
  49. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS Workshop.
  50. Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. In NeurIPS.
  51. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model. arXiv:2310.15110
  52. MVDream: Multi-view Diffusion for 3D Generation. In ICLR.
  53. 3D Neural Field Generation using Triplane Diffusion. In CVPR.
  54. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In NeurIPS.
  55. Score-Based Generative Modeling through Stochastic Differential Equations. In ICLR.
  56. Laplacian Surface Editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing (Nice, France) (SGP ’04). Association for Computing Machinery, New York, NY, USA, 175–184. https://doi.org/10.1145/1057432.1057456
  57. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. In ICLR.
  58. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. In ICLR.
  59. Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision. In NeurIPS.
  60. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In NeurIPS. 27171–27183.
  61. Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. In CVPR.
  62. Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. In ICCV Workshop.
  63. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. In NeurIPS.
  64. Novel View Synthesis with Diffusion Models. In ICLR.
  65. GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation. In CVPR.
  66. DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model. In ICLR.
  67. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721
  68. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV.
  69. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
  70. Locally Attentional SDF Diffusion for Controllable 3D Shape Generation. ACM Transactions on Graphics 42, 4 (2023).
  71. Zhizhuo Zhou and Shubham Tulsiani. 2023. SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction. In CVPR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hansheng Chen (12 papers)
  2. Ruoxi Shi (20 papers)
  3. Yulin Liu (21 papers)
  4. Bokui Shen (16 papers)
  5. Jiayuan Gu (28 papers)
  6. Gordon Wetzstein (144 papers)
  7. Hao Su (218 papers)
  8. Leonidas Guibas (177 papers)
Citations (11)

Summary

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Introduction

Open-domain 3D object synthesis from sparse data and complex computational frameworks has been an ongoing challenge in computer graphics and artificial intelligence. Recent advancements have been made through the use of multi-view diffusion models, leveraging pre-trained 2D models for 3D generation tasks. However, these techniques often struggle with ensuring 3D consistency, retaining high visual quality, or operating efficiently. Addressing these issues, this paper introduces MVEdit, a new framework that implements a 3D Adapter mechanism to produce high-quality textured meshes by employing ancestral sampling and conditioning techniques on multi-view images.

MVEdit Overview

MVEdit capitalizes on off-the-shelf 2D diffusion models and integrates a novel, training-free 3D Adapter to ensure 3D consistency across multi-view inputs. The key innovation lies in its ability to lift 2D views into a coherent 3D representation, subsequently conditioning future 2D views on this 3D model, thereby facilitating cross-view information exchange without compromising visual fidelity. This process takes 2-5 minutes for inference, presenting a better balance between quality, speed, and 3D consistency as compared to previous techniques like score distillation.

Core Contributions

  • 3D Adapter on Existing Diffusion Models: Unlike prior approaches requiring substantial model adjustments or end-to-end training for 3D consistency, MVEdit uses ControlNets to effectively condition the denoising steps of pre-trained 2D diffusion models based on 3D-aware perspectives.
  • Versatile and Extendable Framework: Demonstrated across various tasks such as text/image-to-3D generation, 3D-to-3D editing, and texture synthesis, MVEdit showcases state-of-the-art performance, particularly in image-to-3D and text-guided texture generation.
  • Fast Text-to-3D Initialization: Introducing StableSSDNeRF, a method to fine-tune 2D latent diffusion models for 3D initialization, MVEdit circumvents the scarcity of large 3D datasets and achieves rapid low-resolution 3D generation.

Practical and Theoretical Implications

The MVEdit framework signifies an eminent step towards efficient 3D content generation from 2D data, highlighting the potential of leveraging pre-trained models across dimensions without extensive retraining. Theoretically, it questions and addresses the feasibility of achieving cross-dimensional consistency through conditional diffusion processes, providing a blueprint for future research in 3D generative models.

From a practical standpoint, the versatility and extendability of MVEdit unlock new possibilities in digital content creation, enabling intricate 3D model generation and editing with minimal input requirements. This could particularly benefit industries reliant on rapid prototyping and visualization, like gaming, virtual reality, and film production.

Future Directions in AI and 3D Generation

Looking ahead, the development of purpose-built 3D Adapters, specifically trained to augment 2D diffusion models for 3D tasks, could further improve the efficiency, quality, and consistency of generated objects. Moreover, enhancing the understanding and optimization of the underlying conditioning mechanisms between 2D imagery and 3D models stands as an exciting area for ongoing research, with the potential to bridge the current gap between these dimensions more seamlessly.

In conclusion, MVEdit represents a notable advancement in the domain of 3D object synthesis, promoting a more effective utilization of existing 2D models for 3D generation tasks. Its methodological advancements and practical applications suggest a promising avenue for further exploration and development within the AI and computer graphics research communities.

Youtube Logo Streamline Icon: https://streamlinehq.com