Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation (2410.18974v1)

Published 24 Oct 2024 in cs.CV and cs.AI

Abstract: Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. RenderDiffusion: Image diffusion for 3D reconstruction, inpainting and generation. In CVPR, 2023.
  2. Gaudi: A neural architect for immersive 3d scene generation. In NeurIPS, 2022.
  3. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
  4. Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, 2023a.
  5. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023b.
  6. V3d: Video diffusion models are effective 3d generators, 2024.
  7. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  8. 8-bit optimizers via block-wise quantization. In ICLR, 2022.
  9. Google scanned objects: A high-quality dataset of 3d scanned household items. In ICRA, pp.  2553–2560, 2022.
  10. From data to functa: Your data point is a function and you can treat it like one. In ICML, 2022.
  11. Genesistex: Adapting image denoising diffusion to texture space. In CVPR, 2024.
  12. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML, 2023.
  13. 3dgen: Triplane latent diffusion for textured mesh generation, 2023.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  15. Classifier-free diffusion guidance. In NeurIPS Workshop, 2021.
  16. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  17. 2d gaussian splatting for geometrically accurate radiance fields, 2024.
  18. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  19. Shap-e: Generating conditional 3d implicit functions, 2023.
  20. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  21. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
  22. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In CVPR, 2022.
  23. Adam: A method for stochastic optimization. In ICLR, 2015.
  24. Tracer: Extreme attention guided salient object tracing network. In AAAI, 2022.
  25. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In ICLR, 2024. URL https://openreview.net/forum?id=2lDQLiH1W4.
  26. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  27. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2023a.
  28. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In CVPR, 2024a.
  29. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
  30. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR, 2024b.
  31. Text-guided texturing by synchronized multi-view diffusion, 2023c.
  32. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In ECCV, 2022.
  33. Wonder3d: Single image to 3d using cross-domain diffusion. In CVPR, 2024.
  34. Decoupled weight decay regularization. In ICLR, 2019.
  35. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
  36. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  37. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  39. Diffrf: Rendering-guided 3d radiance field diffusion. In CVPR, 2023.
  40. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.
  41. Scalable diffusion models with transformers. In ICCV, 2023.
  42. State of the art on diffusion models for visual computing. In Eurographics STAR, 2024.
  43. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  44. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In ICLR, 2024.
  45. Learning transferable visual models from natural language supervision. In ICML, pp.  8748–8763, 2021.
  46. Texture: Text-guided texturing of 3d shapes. In SIGGRAPH, 2023.
  47. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Workshop, 2022.
  49. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
  50. Zero123++: a single image to consistent multi-view diffusion base model, 2023.
  51. Mvdream: Multi-view diffusion for 3d generation. In ICLR, 2024.
  52. 3d neural field generation using triplane diffusion. In CVPR, 2023.
  53. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  54. Laplacian surface editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, SGP ’04, pp.  175–184, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 3905673134. doi: 10.1145/1057432.1057456. URL https://doi.org/10.1145/1057432.1057456.
  55. Lgm: Large multi-view gaussian model for high-resolution 3d content creation, 2024a.
  56. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR, 2024b.
  57. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In NeurIPS, 2023.
  58. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. arXiv, 2024.
  59. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, pp.  27171–27183, 2021a.
  60. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. In ICLR, 2024. URL https://openreview.net/forum?id=noe76eRcPC.
  61. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023.
  62. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In ICCV Workshop, 2021b.
  63. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
  64. Novel view synthesis with diffusion models. In ICLR, 2023.
  65. Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. In CVPR, 2024.
  66. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024a.
  67. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. In ICLR, 2024b.
  68. Gaussian opacity fields: Efficient and compact surface reconstruction in unbounded scenes, 2024.
  69. Arf: Artistic radiance fields. In ECCV, 2022.
  70. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  71. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  72. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Transactions on Graphics, 42(4), 2023.
  73. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In CVPR, 2024.
  74. Videomv: Consistent multi-view generation based on large video generative model, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Hansheng Chen (12 papers)
  2. Bokui Shen (16 papers)
  3. Yulin Liu (21 papers)
  4. Ruoxi Shi (20 papers)
  5. Linqi Zhou (20 papers)
  6. Connor Z. Lin (7 papers)
  7. Jiayuan Gu (28 papers)
  8. Hao Su (218 papers)
  9. Gordon Wetzstein (144 papers)
  10. Leonidas Guibas (177 papers)

Summary

Analysis of "3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation"

The paper "3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation" offers a novel approach to enhancing the geometric consistency of 3D object generation using diffusion models. The authors address a critical limitation in existing multi-view image diffusion models—which often lack intrinsic 3D biases—by introducing the 3D-Adapter, a module designed to embed 3D geometry awareness into pretrained image diffusion models.

Core Concepts and Methodologies

The central innovation of the 3D-Adapter involves a process called 3D feedback augmentation. This technique operates by decoding intermediate multi-view features into a coherent 3D representation during the denoising step of the diffusion process. Subsequent re-encoding of the rendered views integrates them back into the base model, thereby enhancing the 3D consistency without altering the original architecture.

Two variants of the 3D-Adapter are explored:

  1. Fast Feed-Forward Version: Utilizes Gaussian splatting for speed during 3D reconstruction, which is beneficial for tasks requiring rapid processing while maintaining quality.
  2. Training-Free Version: Employs neural fields and meshes for adaptable tasks without extensive training, offering flexibility across various applications.

Experimental Evaluation and Results

The authors conducted extensive experiments across several tasks, including text-to-3D, image-to-3D, text-to-texture, and text-to-avatar generation. Notable findings include:

  • Text-to-3D: The 3D-Adapter showed improvements over existing models by significantly enhancing image-text alignment and textual fidelity metrics like CLIP score and aesthetic score.
  • Image-to-3D: The method demonstrated superior performance in maintaining visual quality, outstripping methods such as One-2-3-45 and others, with better image-text content consistency.
  • Text-to-Texture and Text-to-Avatar: This model also outperformed competitors by providing both geometric and texture consistency, with high CLIP scores and low mean depth distortion metrics that align with the desired attributes.

Theoretical Contributions

The paper provides insightful theoretical analysis into the limitations of input/output synchronization techniques common in diffusion models. This includes identifying how score averaging leads to mode collapse, thereby losing finer details—a significant drawback that the 3D feedback augmentation method aims to rectify.

Implications and Future Directions

The implications of this work span both practical applications and theoretical insights in 3D generation models. By refining the geometric consistency of 3D diffusion models, the 3D-Adapter promises enhanced applicability in fields such as virtual reality, gaming, and digital content creation. Future developments could focus on further optimizing computational efficiency and exploring adaptive approaches to dynamic scenes.

In conclusion, the introduction of the 3D-Adapter presents a significant step towards bridging the gap in geometric consistency between 2D and 3D diffusion models, offering robust and flexible methods suitable for a broad spectrum of 3D generation tasks. The research insights and methodologies presented have the potential to influence AI-driven 3D modeling and open avenues for more advanced exploration in neural rendering and diffusion applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com