Customizing Text-to-Image Diffusion with Object Viewpoint Control (2404.12333v2)
Abstract: Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt.
- A neural space-time representation for text-to-image personalization. ACM Transactions on Graphics (TOG), 2023.
- Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In SIGGRAPH Asia 2023 Conference Papers, 2023.
- Multidiffusion: Fusing diffusion paths for controlled image generation. In International Conference on Machine Learning (ICML), 2023.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Zip-nerf: Anti-aliased grid-based neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Nerd: Neural reflectance decomposition from image collections. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Samurai: Shape and material from unconstrained real-world arbitrary image collections. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Ledits++: Limitless image editing using text-to-image models. arXiv preprint arXiv:2311.16711, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Genvs: Generative novel view synthesis with 3d-aware diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023.
- ChatGPT. Chatgpt. https://chat.openai.com/chat, 2022.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 2023.
- Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
- 3-sweep: Extracting editable objects from a single photo. ACM Transactions on graphics (TOG), 2013.
- Subject-driven text-to-image generation via apprenticeship learning. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Learning continuous 3d words for text-to-image generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Depth-supervised nerf: Fewer views and faster training for free. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Diffusion models beat gans on image synthesis. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Vica-nerf: View-consistency-aware 3d editing of neural radiance fields. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- K-planes: Explicit radiance fields in space, time, and appearance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision (ECCV), 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations (ICLR), 2023a.
- Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 2023b.
- Expressive text-to-image generation with rich text. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Svdiff: Compact parameter space for diffusion fine-tuning. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Prompt-to-prompt image editing with cross attention control. In International Conference on Learning Representations (ICLR), 2023.
- Denoising diffusion probabilistic models. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Viewdiff: 3d-consistent image generation with text-to-image models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022.
- Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Scaling up gans for text-to-image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Elucidating the design space of diffusion-based generative models. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Analyzing and improving the training dynamics of diffusion models. arXiv preprint arXiv:2312.02696, 2023.
- Rendering synthetic objects into legacy photographs. ACM Transactions on graphics (TOG), 2011.
- Imagic: Text-based real image editing with diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
- Lerf: Language embedded radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2023.
- 3d object manipulation in a single photograph using stock 3d models. ACM Transactions on graphics (TOG), 2014.
- Dense text-to-image generation with attention modulation. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
- Segment anything. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Multi-concept customization of text-to-image diffusion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Magic3d: High-resolution text-to-3d content creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations (ICLR), 2022.
- Zero-1-to-3: Zero-shot one image to 3d object. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Syncdreamer: Generating multiview-consistent images from a single-view image. In International Conference on Learning Representations (ICLR), 2024.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations (ICLR), 2022.
- Object 3dit: Language-guided 3d-aware image editing. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021.
- Null-text inversion for editing real images using guided diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Conference on Artificial Intelligence (AAAI), 2024.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 2022.
- Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Dinov2: Learning robust visual features without supervision. In TMLR, 2023.
- Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- Styleclip: Text-driven manipulation of stylegan imagery. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Localizing object-level shape variations with text-to-image diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Scalable diffusion models with transformers. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations (ICLR), 2023.
- D-NeRF: Neural Radiance Fields for Dynamic Scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
- Dreambooth3d: Subject-driven text-to-3d generation. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In IEEE International Conference on Computer Vision (ICCV), 2021.
- High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
- Simo Ryu. Lora-stable diffusion. https://github.com/cloneofsimo/lora, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv preprint arXiv:2310.17994, 2023.
- Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International Conference on Machine Learning (ICML), 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Mvdream: Multi-view diffusion for 3d generation. In International Conference on Learning Representations (ICLR), 2024.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
- Total-recon: Deformable scene reconstruction for embodied view synthesis. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021.
- Learned initializations for optimizing coordinate-based neural representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
- Face0: Instantaneously conditioning a text-to-image model on a face. In SIGGRAPH Asia 2023 Conference Papers, 2023.
- Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
- p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
- Evaluating data attribution for text-to-image models. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Reconfusion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023.
- 3d-aware scene manipulation via inverse graphics. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
- Featurenerf: Learning generalizable nerfs by distilling foundation models. In IEEE International Conference on Computer Vision (ICCV), 2023.
- pixelnerf: Neural radiance fields from one or few images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Scaling autoregressive models for content-rich text-to-image generation. In International Conference on Machine Learning (ICML), 2022.
- Relpose: Predicting probabilistic relative rotation for single objects in the wild. In European Conference on Computer Vision (ECCV), 2022.
- Cameras as rays: Sparse-view pose estimation via ray diffusion. In International Conference on Learning Representations (ICLR), 2024.
- Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. In International Conference on Learning Representations (ICLR), 2021.
- Prospect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics (TOG), 2023.
- Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision (ECCV), 2022.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.