Break-A-Scene: Extracting Multiple Concepts from a Single Image (2305.16311v2)
Abstract: Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: https://omriavrahami.com/break-a-scene/
- Image2stylegan: How to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4432–4441.
- Image2stylegan++: How to edit the embedded images?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8296–8305.
- HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 18490–18500. https://api.semanticscholar.org/CorpusID:244729249
- Blended Latent Diffusion. ACM Trans. Graph. 42, 4, Article 149 (jul 2023), 11 pages. https://doi.org/10.1145/3592450
- SpaText: Spatio-Textual Representation for Controllable Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18370–18380.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18208–18218.
- Text2live: Text-driven layered image and video editing. In European conference on computer vision. Springer, 707–723.
- Multidiffusion: Fusing diffusion paths for controlled image generation. (2023).
- Paint by Word. arXiv:2103.10951 [cs.CV]
- Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (TOG) 38 (2019), 1 – 11.
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
- Emerging Properties in Self-Supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9630–9640.
- Muse: Text-To-Image Generation via Masked Generative Transformers. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:255372955
- Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Transactions on Graphics (TOG) 42 (2023), 1 – 10. https://api.semanticscholar.org/CorpusID:256416326
- Transformer Interpretability Beyond Attention Visualization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 782–791.
- Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 387–396.
- Subject-driven Text-to-Image Generation via Apprenticeship Learning. ArXiv abs/2304.00186 (2023).
- Per-Pixel Classification is Not All You Need for Semantic Segmentation. In Neural Information Processing Systems.
- “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In European Conference on Computer Vision. Springer, 558–577.
- DiffEdit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision. Springer, 88–105.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision. Springer, 89–106.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations.
- Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. ACM Transactions on Graphics (TOG) 42 (2023), 1 – 13. https://api.semanticscholar.org/CorpusID:257364757
- Generative adversarial nets. Advances in neural information processing systems 27 (2014).
- SVDiff: Compact Parameter Space for Diffusion Fine-Tuning. ArXiv abs/2303.11305 (2023).
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations.
- Imagen Video: High Definition Video Generation with Diffusion Models. ArXiv abs/2210.02303 (2022).
- Denoising Diffusion Probabilistic Models. In Proc. NeurIPS.
- Eliahu Horwitz and Yedid Hoshen. 2022. Conffusion: Confidence Intervals for Diffusion Models. ArXiv abs/2211.09795 (2022).
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
- Word-As-Image for Semantic Typography. ACM Transactions on Graphics (TOG) 42 (2023), 1 – 11. https://api.semanticscholar.org/CorpusID:257353586
- Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models. ArXiv abs/2304.02642 (2023).
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4401–4410.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6007–6017.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
- Segment Anything. arXiv:2304.02643 [cs.CV]
- William H. Kruskal and Wilson Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis. J. Amer. Statist. Assoc. 47 (1952), 583–621.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
- Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.
- Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9992–10002.
- Generating Images from Captions with Attention. CoRR abs/1511.02793 (2016).
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6038–6047.
- Dreamix: Video Diffusion Models are General Video Editors. ArXiv abs/2302.01329 (2023).
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:245335086
- Mystyle: A personalized generative prior. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–10.
- Localizing Object-level Shape Variations with Text-to-Image Diffusion Models. ArXiv abs/2303.11306 (2023).
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 2065–2074. https://api.semanticscholar.org/CorpusID:232428282
- Adversarial Latent Autoencoders. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 14092–14101.
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
- Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
- Generative adversarial text to image synthesis. In Proc. ICLR. 1060–1069.
- Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 2287–2296.
- TEXTure: Text-Guided Texturing of 3D Shapes. ACM SIGGRAPH 2023 Conference Proceedings (2023). https://api.semanticscholar.org/CorpusID:256597953
- Pivotal Tuning for Latent-based Editing of Real Images. ACM Transactions on Graphics (TOG) 42 (2021), 1 – 13.
- High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
- Simo Ryu. 2022. Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning. https://github.com/cloneofsimo/lora.
- Palette: Image-to-Image Diffusion Models. ACM SIGGRAPH 2022 Conference Proceedings (2021).
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
- Image Super-Resolution via Iterative Refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2021), 4713–4726.
- kNN-Diffusion: Image Generation via Large-Scale Retrieval. In The Eleventh International Conference on Learning Representations.
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. ArXiv abs/2304.03411 (2023).
- Make-A-Video: Text-to-Video Generation without Text-Video Data. In The Eleventh International Conference on Learning Representations.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
- Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
- Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019).
- ObjectStitch: Generative Object Compositing. ArXiv abs/2212.00932 (2022).
- Key-Locked Rank One Editing for Text-to-Image Personalization. ACM SIGGRAPH 2023 Conference Proceedings (2023). https://api.semanticscholar.org/CorpusID:258436985
- Designing an encoder for StyleGAN image manipulation. ACM Transactions on Graphics (TOG) 40 (2021), 1 – 14.
- John W. Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics 5 2 (1949), 99–114.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1921–1930.
- UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image. arXiv preprint arXiv:2210.09477 (2022).
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers.
- P+: Extended Textual Conditioning in Text-to-Image Generation. ArXiv abs/2303.09522 (2023).
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18359–18369.
- Yuxiang Wei. 2023. Official Implementation of ELITE. https://github.com/csyxwei/ELITE. Accessed: 2023-05-01.
- ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. ArXiv abs/2302.13848 (2023).
- GAN Inversion: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2021), 3121–3138.
- AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316–1324.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18381–18391.
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv preprint arXiv:2206.10789 (2022).
- StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proc. ICCV. 5907–5915.
- StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1947–1962.
- In-domain gan inversion for real image editing. In European conference on computer vision. Springer, 592–608.
- Improved StyleGAN Embedding: Where are the Good Latents? ArXiv abs/2012.09036 (2020).