Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models (2403.01807v2)

Published 4 Mar 2024 in cs.CV

Abstract: 3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Renderdiffusion: Image diffusion for 3D reconstruction, inpainting and generation. In CVPR, 2023.
  2. Demystifying MMD GANs. In ICLR, 2018.
  3. TransformerFusion: Monocular RGB scene reconstruction using transformers. In NeurIPS, 2021.
  4. Generative novel view synthesis with 3D-aware diffusion models. arXiv:2304.02602, 2023.
  5. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In ICCV, 2023.
  6. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017.
  7. Objaverse: A universe of annotated 3D objects. In CVPR, 2023.
  8. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
  9. An image is worth 16×\times×16 words: Transformers for image recognition at scale. In ICLR, 2021.
  10. Instruct-NeRF2NeRF: Editing 3D scenes with instructions. In ICCV, 2023.
  11. Deep residual learning for image recognition. In CVPR, 2016.
  12. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
  13. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
  14. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  15. Text2Room: Extracting textured 3D meshes from 2D text-to-image models. In ICCV, 2023.
  16. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  17. Atlas: Few-shot learning with retrieval augmented language models. JMLR, 24(251):1–43, 2023.
  18. HoloFusion: Towards photo-realistic 3D generative modeling. In ICCV, 2023a.
  19. HoloDiffusion: Training a 3D diffusion model using 2D images. In CVPR, 2023b.
  20. Learning blind video temporal consistency. In European Conference on Computer Vision, 2018.
  21. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
  22. Magic3D: High-resolution text-to-3D content creation. In CVPR, 2023.
  23. One-2-3-45: Any single image to 3D mesh in 45 seconds without per-shape optimization. arXiv:2306.16928, 2023a.
  24. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023b.
  25. Syncdreamer: Generating multiview-consistent images from a single-view image. In The Twelfth International Conference on Learning Representations, 2024.
  26. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  27. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4), 2022.
  28. DiffRF: Rendering-guided 3D radiance field diffusion. In CVPR, 2023.
  29. DreamFusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
  30. Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv:2306.17843, 2023.
  31. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022.
  32. MERF: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
  33. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. In ICCV, 2021.
  34. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  35. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  36. Nikita Selin. CarveKit. github.com/OPHoperHPO/image-background-remove-tool, 2023.
  37. Let 2D diffusion model know 3D-consistency for robust text-to-3D generation. arXiv:2303.07937, 2023.
  38. MVDream: Multi-view diffusion for 3D generation. arXiv:2308.16512, 2023.
  39. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  40. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
  41. NeuralRecon: Real-time coherent 3D reconstruction from monocular video. In CVPR, 2021.
  42. ViewSet diffusion: (0-)image-conditioned 3D generative models from 2D data. In ICCV, 2023.
  43. Nerfstudio: A modular framework for neural radiance field development. In SIGGRAPH, 2023.
  44. Make-it-3D: High-fidelity 3D creation from a single image with diffusion prior. In ICCV, 2023a.
  45. MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. In NeurIPS, 2023b.
  46. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In NeurIPS, 2023.
  47. TextMesh: Generation of realistic 3D meshes from text prompts. In 3DV, 2024.
  48. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
  49. Score Jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation. In CVPR, 2023a.
  50. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021a.
  51. IBRNet: Learning multi-view image-based rendering. In CVPR, 2021b.
  52. ProlificDreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. arXiv:2305.16213, 2023b.
  53. Novel view synthesis with diffusion models. In ICLR, 2023.
  54. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  55. 3D-aware image generation using 2D diffusion models. In ICCV, 2023.
  56. Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia, 2023.
  57. ScanNet++: A high-fidelity dataset of 3D indoor scenes. In ICCV, 2023.
  58. SDFStudio: A unified framework for surface reconstruction, 2023.
  59. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  60. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  61. UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. In NeurIPS, 2023.
  62. Stereo magnification: Learning view synthesis using multiplane images. ACM Transactions on Graphics, 37(4):65:1–12, 2018.
  63. SparseFusion: Distilling view-conditioned diffusion for 3D reconstruction. In CVPR, 2023.
  64. HiFA: High-fidelity text-to-3D with advanced diffusion guidance. arXiv:2305.18766, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Lukas Höllein (8 papers)
  2. Norman Müller (16 papers)
  3. David Novotny (42 papers)
  4. Hung-Yu Tseng (31 papers)
  5. Christian Richardt (36 papers)
  6. Michael Zollhöfer (51 papers)
  7. Matthias Nießner (177 papers)
  8. Aljaž Božič (14 papers)
Citations (24)

Summary

3D-Consistent Image Generation via Text-to-Image Models: Insights from ViewDiff

The paper, "ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models," addresses the challenge of generating photorealistic, 3D-consistent images from text or posed input images using pretrained text-to-image (T2I) models. The work leverages existing T2I models to create high-quality images that are consistent across multiple viewpoints, integrating innovations in the architecture to enhance 3D understanding and rendering.

Core Contributions

The authors present several key contributions that set their work apart in the domain of text-to-3D image generation:

  1. Integration of 3D Volume-Rendering and Cross-Frame Attention: By embedding these techniques into the U-Net architecture of existing T2I models, the method learns to output multi-view consistent images in a single denoising process.
  2. Autoregressive Generation Scheme: This allows the generation of more consistent 3D images from any novel viewpoint, improving the ability to predict object rendering across varied perspectives.
  3. Architecture Augmentation with Cross-Frame Attention and Projection Layers: These layers facilitate the synthesis of images that maintain object identity across views, offering a notable improvement in generating realistic surroundings.

Results and Numerical Insights

The proposed method demonstrates significant improvements over existing models, with results showing a decrease in visual discrepancies:

  • FID Reduction: A reduction of approximately 30% in the Fréchet Inception Distance indicates a marked improvement in image quality.
  • KID Reduction: A 37% drop in Kernel Inception Distance further validates the enhanced modeling of the data distribution.

Overall, these results confirm the method's ability to generate diverse and high-quality 3D-consistent images, outpacing earlier frameworks that struggled with photorealism and context integration.

Implications and Future Directions

Practical Implications: The capability to generate photorealistic 3D images from textual descriptions has considerable implications for industries such as gaming, virtual reality, and film. The integration of authentic surrounding generation suggests potential in constructing complex scene arrangements where continuity across views is critical.

Theoretical Implications: The paper extends the use of pretrained 2D diffusion models, highlighting that with appropriate architectural modifications, these models can be repurposed for sophisticated 3D tasks. The methodology can inspire future works aiming to leverage existing 2D models in 3D domains.

Speculation on Future AI Developments: As AI continues to evolve, one could anticipate the refinement of techniques that bridge the gap between 2D and 3D contexts. The integration of additional modalities or the fine-tuning of existing architectures to anticipate lighting or material variations could further enhance realism.

In conclusion, "ViewDiff" makes substantial strides in using pretrained text-to-image systems for generating 3D-consistent renderings. By integrating novel architectural enhancements, the method not only improves current state-of-the-art techniques but also provides a pathway for future explorations in AI-driven content creation across varying dimensions.

Youtube Logo Streamline Icon: https://streamlinehq.com