MVDream: Multi-view Diffusion for 3D Generation (2308.16512v4)

Published 31 Aug 2023 in cs.CV

Abstract: We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

Authors (6)

Yichun Shi (40 papers)
Peng Wang (832 papers)
Jianglong Ye (11 papers)
Mai Long (1 paper)
Kejie Li (22 papers)
Xiao Yang (158 papers)

Citations (441)

View on Semantic Scholar

Summary

Overview of MVDream: Multi-view Diffusion for 3D Generation

The paper introduces MVDream, a novel multi-view diffusion model developed for generating consistent multi-view images from textual prompts. This work builds upon the limitations of existing 2D-lifting methods used for 3D generation, which typically suffer from inconsistencies across views due to a lack of comprehensive multi-view knowledge. By integrating both 2D and 3D data, the proposed model not only improves the generalizability of 3D asset creation but also maintains consistency across generated views.

Key Contributions

Multi-view Diffusion Model: The core contribution of MVDream is its ability to maintain cross-view consistency by properly leveraging pre-trained 2D diffusion models and adapting their architecture to include a 3D-aware self-attention mechanism. This adaptation enables the model to effectively capture cross-view dependencies, overcoming issues like content drift and the multi-face Janus problem prevalent in 2D-lifting methods.
Application in 3D Generation: By employing Score Distillation Sampling, MVDream serves as a robust 3D prior that significantly enhances the performance of 3D generation techniques. The model's framework integrates multi-view supervision, contributing to more stable and realistic 3D assets without sacrificing the diversity of generated content.
Multi-view DreamBooth: Inspired by DreamBooth, MVDream is adapted for 3D scenarios, allowing fine-tuning on small datasets to learn new concepts and assimilate identity-specific content while maintaining multi-view consistency. This approach fundamentally merges image-based diffusion models with the complexity of multi-view 3D scene generation.

Methodology

Training Framework: MVDream is trained on a mixture of large-scale 2D image-text pairs and 3D-rendered datasets from the Objaverse collection, striking a balance between generalizability and view consistency. The model leverages the power of an existing 2D diffusion model backbone from Stable Diffusion.
3D Self-Attention Mechanism: Central to the model's success is an inflated 3D self-attention mechanism, which adapts the traditional 2D attention layers to incorporate cross-view connections. This design efficiently aligns tokens across different views during the diffusion process, thus ensuring consistency.
3D Generation via SDS: Exploiting the implicitly learned multi-view 3D priors, MVDream applies Score Distillation Sampling (SDS) to existing 3D representations, thereby enhancing their fidelity and spatial coherence. Techniques such as timestep annealing and negative prompt conditioning are introduced to further refine the model's performance in text-to-3D generation tasks.

Numerical Results and Claims

The experiments show that MVDream attains favorable results compared to contemporary 3D generation methods, surpassing existing models in terms of multi-view consistency, as illustrated through user studies and numerous qualitative comparisons. MVDream's stability reduces the common necessity of manual parameter tuning observed in other approaches.

Implications and Future Directions

The implications of this research extend the frontier of AI in the field of 3D content creation, integrating textual information with sophisticated geospatial data to streamline and improve asset generation workflows. The proposed architecture offers a scalable pathway to high-fidelity 3D model generation, potentially influencing applications in gaming, media, and virtual reality domains.

Future work may investigate scaling the model to higher resolution outputs or integrating more diverse datasets to enhance the style and realism of generated 3D assets. Additionally, exploring integrations with other state-of-the-art diffusion models could broaden MVDream's applicability across varied industries. Further research could also evaluate the ethical considerations inherent in AI-generated content, aiming to mitigate biases and ensuring responsible deployment of these technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MoonL88537/status/1789021617709781418

YouTube

Show All Videos