ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models (2403.01807v2)

Published 4 Mar 2024 in cs.CV

Abstract: 3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

References (64)

Authors (8)

Lukas Höllein (8 papers)
Norman Müller (16 papers)
David Novotny (42 papers)
Hung-Yu Tseng (31 papers)
Christian Richardt (36 papers)
Michael Zollhöfer (51 papers)
Matthias Nießner (177 papers)
Aljaž Božič (14 papers)

Citations (24)

View on Semantic Scholar

Summary

3D-Consistent Image Generation via Text-to-Image Models: Insights from ViewDiff

The paper, "ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models," addresses the challenge of generating photorealistic, 3D-consistent images from text or posed input images using pretrained text-to-image (T2I) models. The work leverages existing T2I models to create high-quality images that are consistent across multiple viewpoints, integrating innovations in the architecture to enhance 3D understanding and rendering.

Core Contributions

The authors present several key contributions that set their work apart in the domain of text-to-3D image generation:

Integration of 3D Volume-Rendering and Cross-Frame Attention: By embedding these techniques into the U-Net architecture of existing T2I models, the method learns to output multi-view consistent images in a single denoising process.
Autoregressive Generation Scheme: This allows the generation of more consistent 3D images from any novel viewpoint, improving the ability to predict object rendering across varied perspectives.
Architecture Augmentation with Cross-Frame Attention and Projection Layers: These layers facilitate the synthesis of images that maintain object identity across views, offering a notable improvement in generating realistic surroundings.

Results and Numerical Insights

The proposed method demonstrates significant improvements over existing models, with results showing a decrease in visual discrepancies:

FID Reduction: A reduction of approximately 30% in the Fréchet Inception Distance indicates a marked improvement in image quality.
KID Reduction: A 37% drop in Kernel Inception Distance further validates the enhanced modeling of the data distribution.

Overall, these results confirm the method's ability to generate diverse and high-quality 3D-consistent images, outpacing earlier frameworks that struggled with photorealism and context integration.

Implications and Future Directions

Practical Implications: The capability to generate photorealistic 3D images from textual descriptions has considerable implications for industries such as gaming, virtual reality, and film. The integration of authentic surrounding generation suggests potential in constructing complex scene arrangements where continuity across views is critical.

Theoretical Implications: The paper extends the use of pretrained 2D diffusion models, highlighting that with appropriate architectural modifications, these models can be repurposed for sophisticated 3D tasks. The methodology can inspire future works aiming to leverage existing 2D models in 3D domains.

Speculation on Future AI Developments: As AI continues to evolve, one could anticipate the refinement of techniques that bridge the gap between 2D and 3D contexts. The integration of additional modalities or the fine-tuning of existing architectures to anticipate lighting or material variations could further enhance realism.

In conclusion, "ViewDiff" makes substantial strides in using pretrained text-to-image systems for generating 3D-consistent renderings. By integrating novel architectural enhancements, the method not only improves current state-of-the-art techniques but also provides a pathway for future explorations in AI-driven content creation across varying dimensions.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1765035977809019162

https://twitter.com/arankomatsuzaki/status/1765160866931159067

https://twitter.com/taziku_co/status/1765367899664134196

https://twitter.com/knishimae0531/status/1765168480628969760

YouTube

Show All Videos