ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models (2403.01807v2)
Abstract: 3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).
- Renderdiffusion: Image diffusion for 3D reconstruction, inpainting and generation. In CVPR, 2023.
- Demystifying MMD GANs. In ICLR, 2018.
- TransformerFusion: Monocular RGB scene reconstruction using transformers. In NeurIPS, 2021.
- Generative novel view synthesis with 3D-aware diffusion models. arXiv:2304.02602, 2023.
- Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In ICCV, 2023.
- ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017.
- Objaverse: A universe of annotated 3D objects. In CVPR, 2023.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- An image is worth 16×\times×16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Instruct-NeRF2NeRF: Editing 3D scenes with instructions. In ICCV, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
- Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Text2Room: Extracting textured 3D meshes from 2D text-to-image models. In ICCV, 2023.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Atlas: Few-shot learning with retrieval augmented language models. JMLR, 24(251):1–43, 2023.
- HoloFusion: Towards photo-realistic 3D generative modeling. In ICCV, 2023a.
- HoloDiffusion: Training a 3D diffusion model using 2D images. In CVPR, 2023b.
- Learning blind video temporal consistency. In European Conference on Computer Vision, 2018.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
- Magic3D: High-resolution text-to-3D content creation. In CVPR, 2023.
- One-2-3-45: Any single image to 3D mesh in 45 seconds without per-shape optimization. arXiv:2306.16928, 2023a.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023b.
- Syncdreamer: Generating multiview-consistent images from a single-view image. In The Twelfth International Conference on Learning Representations, 2024.
- NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4), 2022.
- DiffRF: Rendering-guided 3D radiance field diffusion. In CVPR, 2023.
- DreamFusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
- Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv:2306.17843, 2023.
- Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022.
- MERF: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
- Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. In ICCV, 2021.
- DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Nikita Selin. CarveKit. github.com/OPHoperHPO/image-background-remove-tool, 2023.
- Let 2D diffusion model know 3D-consistency for robust text-to-3D generation. arXiv:2303.07937, 2023.
- MVDream: Multi-view diffusion for 3D generation. arXiv:2308.16512, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
- NeuralRecon: Real-time coherent 3D reconstruction from monocular video. In CVPR, 2021.
- ViewSet diffusion: (0-)image-conditioned 3D generative models from 2D data. In ICCV, 2023.
- Nerfstudio: A modular framework for neural radiance field development. In SIGGRAPH, 2023.
- Make-it-3D: High-fidelity 3D creation from a single image with diffusion prior. In ICCV, 2023a.
- MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. In NeurIPS, 2023b.
- Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In NeurIPS, 2023.
- TextMesh: Generation of realistic 3D meshes from text prompts. In 3DV, 2024.
- Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
- Score Jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation. In CVPR, 2023a.
- NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021a.
- IBRNet: Learning multi-view image-based rendering. In CVPR, 2021b.
- ProlificDreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. arXiv:2305.16213, 2023b.
- Novel view synthesis with diffusion models. In ICLR, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
- 3D-aware image generation using 2D diffusion models. In ICCV, 2023.
- Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia, 2023.
- ScanNet++: A high-fidelity dataset of 3D indoor scenes. In ICCV, 2023.
- SDFStudio: A unified framework for surface reconstruction, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. In NeurIPS, 2023.
- Stereo magnification: Learning view synthesis using multiplane images. ACM Transactions on Graphics, 37(4):65:1–12, 2018.
- SparseFusion: Distilling view-conditioned diffusion for 3D reconstruction. In CVPR, 2023.
- HiFA: High-fidelity text-to-3D with advanced diffusion guidance. arXiv:2305.18766, 2023.
- Lukas Höllein (8 papers)
- Norman Müller (16 papers)
- David Novotny (42 papers)
- Hung-Yu Tseng (31 papers)
- Christian Richardt (36 papers)
- Michael Zollhöfer (51 papers)
- Matthias Nießner (177 papers)
- Aljaž Božič (14 papers)