Compositional Inversion for Stable Diffusion Models
Abstract: Inversion methods, such as Textual Inversion, generate personalized images by incorporating concepts of interest provided by user images. However, existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space. To address this issue, we propose a method that guides the inversion process towards the core distribution for compositional embeddings. Additionally, we introduce a spatial regularization approach to balance the attention on the concepts being composed. Our method is designed as a post-training approach and can be seamlessly integrated with other inversion methods. Experimental results demonstrate the effectiveness of our proposed approach in mitigating the overfitting problem and generating more diverse and balanced compositions of concepts in the synthesized images. The source code is available at https://github.com/zhangxulu1996/Compositional-Inversion.
- End-to-end object detection with transformers. In European Conference on Computer Vision, 213–229. Springer.
- Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 248–255. Ieee.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In International Conference on Learning Representations.
- Generative adversarial nets. Advances in Neural Information Processing Systems, 27.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10696–10706.
- Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 770–778.
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In International Conference on Learning Representations.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4401–4410.
- Auto-Encoding Variational Bayes. In International Conference on Learning Representations.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1931–1941.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22511–22521.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, 740–755. Springer.
- Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125.
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
- Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583): 607–609.
- Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, 1–11.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22500–22510.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
- Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
- DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16515–16525.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, 1–11.
- Coached active learning for interactive video search. In Proceedings of the 19th ACM international conference on Multimedia, 443–452.
- Coaching the exploration and exploitation in active learning for interactive video retrieval. IEEE Transactions on Image Processing, 22(3): 955–968.
- Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14246–14255.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5810.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.