Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Break-A-Scene: Extracting Multiple Concepts from a Single Image (2305.16311v2)

Published 25 May 2023 in cs.CV, cs.GR, and cs.LG

Abstract: Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: https://omriavrahami.com/break-a-scene/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Image2stylegan: How to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4432–4441.
  2. Image2stylegan++: How to edit the embedded images?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8296–8305.
  3. HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 18490–18500. https://api.semanticscholar.org/CorpusID:244729249
  4. Blended Latent Diffusion. ACM Trans. Graph. 42, 4, Article 149 (jul 2023), 11 pages. https://doi.org/10.1145/3592450
  5. SpaText: Spatio-Textual Representation for Controllable Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18370–18380.
  6. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18208–18218.
  7. Text2live: Text-driven layered image and video editing. In European conference on computer vision. Springer, 707–723.
  8. Multidiffusion: Fusing diffusion paths for controlled image generation. (2023).
  9. Paint by Word. arXiv:2103.10951 [cs.CV]
  10. Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (TOG) 38 (2019), 1 – 11.
  11. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
  12. Emerging Properties in Self-Supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9630–9640.
  13. Muse: Text-To-Image Generation via Masked Generative Transformers. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:255372955
  14. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Transactions on Graphics (TOG) 42 (2023), 1 – 10. https://api.semanticscholar.org/CorpusID:256416326
  15. Transformer Interpretability Beyond Attention Visualization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 782–791.
  16. Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 387–396.
  17. Subject-driven Text-to-Image Generation via Apprenticeship Learning. ArXiv abs/2304.00186 (2023).
  18. Per-Pixel Classification is Not All You Need for Semantic Segmentation. In Neural Information Processing Systems.
  19. “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In European Conference on Computer Vision. Springer, 558–577.
  20. DiffEdit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations.
  21. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision. Springer, 88–105.
  22. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  23. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision. Springer, 89–106.
  24. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations.
  25. Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. ACM Transactions on Graphics (TOG) 42 (2023), 1 – 13. https://api.semanticscholar.org/CorpusID:257364757
  26. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
  27. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning. ArXiv abs/2303.11305 (2023).
  28. Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations.
  29. Imagen Video: High Definition Video Generation with Diffusion Models. ArXiv abs/2210.02303 (2022).
  30. Denoising Diffusion Probabilistic Models. In Proc. NeurIPS.
  31. Eliahu Horwitz and Yedid Hoshen. 2022. Conffusion: Confidence Intervals for Diffusion Models. ArXiv abs/2211.09795 (2022).
  32. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  33. Word-As-Image for Semantic Typography. ACM Transactions on Graphics (TOG) 42 (2023), 1 – 11. https://api.semanticscholar.org/CorpusID:257353586
  34. Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models. ArXiv abs/2304.02642 (2023).
  35. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4401–4410.
  36. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119.
  37. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6007–6017.
  38. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
  39. Segment Anything. arXiv:2304.02643 [cs.CV]
  40. William H. Kruskal and Wilson Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis. J. Amer. Statist. Assoc. 47 (1952), 583–621.
  41. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
  42. Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.
  43. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.
  44. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9992–10002.
  45. Generating Images from Captions with Attention. CoRR abs/1511.02793 (2016).
  46. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
  47. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6038–6047.
  48. Dreamix: Video Diffusion Models are General Video Editors. ArXiv abs/2302.01329 (2023).
  49. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:245335086
  50. Mystyle: A personalized generative prior. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–10.
  51. Localizing Object-level Shape Variations with Text-to-Image Diffusion Models. ArXiv abs/2303.11306 (2023).
  52. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 2065–2074. https://api.semanticscholar.org/CorpusID:232428282
  53. Adversarial Latent Autoencoders. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 14092–14101.
  54. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
  55. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
  56. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  57. Generative adversarial text to image synthesis. In Proc. ICLR. 1060–1069.
  58. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 2287–2296.
  59. TEXTure: Text-Guided Texturing of 3D Shapes. ACM SIGGRAPH 2023 Conference Proceedings (2023). https://api.semanticscholar.org/CorpusID:256597953
  60. Pivotal Tuning for Latent-based Editing of Real Images. ACM Transactions on Graphics (TOG) 42 (2021), 1 – 13.
  61. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685.
  62. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  63. Simo Ryu. 2022. Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning. https://github.com/cloneofsimo/lora.
  64. Palette: Image-to-Image Diffusion Models. ACM SIGGRAPH 2022 Conference Proceedings (2021).
  65. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  66. Image Super-Resolution via Iterative Refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2021), 4713–4726.
  67. kNN-Diffusion: Image Generation via Large-Scale Retrieval. In The Eleventh International Conference on Learning Representations.
  68. InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. ArXiv abs/2304.03411 (2023).
  69. Make-A-Video: Text-to-Video Generation without Text-Video Data. In The Eleventh International Conference on Learning Representations.
  70. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
  71. Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
  72. Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019).
  73. ObjectStitch: Generative Object Compositing. ArXiv abs/2212.00932 (2022).
  74. Key-Locked Rank One Editing for Text-to-Image Personalization. ACM SIGGRAPH 2023 Conference Proceedings (2023). https://api.semanticscholar.org/CorpusID:258436985
  75. Designing an encoder for StyleGAN image manipulation. ACM Transactions on Graphics (TOG) 40 (2021), 1 – 14.
  76. John W. Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics 5 2 (1949), 99–114.
  77. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1921–1930.
  78. UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image. arXiv preprint arXiv:2210.09477 (2022).
  79. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers.
  80. P+: Extended Textual Conditioning in Text-to-Image Generation. ArXiv abs/2303.09522 (2023).
  81. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18359–18369.
  82. Yuxiang Wei. 2023. Official Implementation of ELITE. https://github.com/csyxwei/ELITE. Accessed: 2023-05-01.
  83. ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. ArXiv abs/2302.13848 (2023).
  84. GAN Inversion: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2021), 3121–3138.
  85. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316–1324.
  86. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18381–18391.
  87. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv preprint arXiv:2206.10789 (2022).
  88. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proc. ICCV. 5907–5915.
  89. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1947–1962.
  90. In-domain gan inversion for real image editing. In European conference on computer vision. Springer, 592–608.
  91. Improved StyleGAN Embedding: Where are the Good Latents? ArXiv abs/2012.09036 (2020).
Citations (132)

Summary

  • The paper presents a two-phase optimization that initially freezes weights to tune concept handles, then fine-tunes model weights to maintain distinct concept identities.
  • It employs a masked diffusion loss and a novel cross-attention loss to accurately disentangle and reproduce multiple visual concepts from a single scene.
  • Evaluations show marked improvements in prompt and identity similarity over methods like Textual Inversion, DreamBooth, and Custom Diffusion.

Break-A-Scene: Extracting Multiple Concepts from a Single Image

The research paper "Break-A-Scene: Extracting Multiple Concepts from a Single Image" presents a novel approach aimed at improving text-to-image (T2I) model personalization by addressing the challenge of extracting discrete visual concepts from a single image. Traditional methods predominantly focus on deriving a singular concept from numerous images that vary in context; however, they falter when tasked with isolating multiple concepts within a lone image. This task, defined by the authors as textual scene decomposition, endeavors to assign distinct text tokens that encapsulate diverse concepts identified within a solitary scene, thus enabling refined control over the generation of scenes through text prompts.

The proposed method hinges on augmenting input images with masks that distinguish the presence of selected concepts. These masks can either be user-provided or generated automatically using pre-established segmentation models. The paper introduces a two-phase customization process combining optimization of textual embeddings (handles) specifically allocated for each concept and the adaptive fine-tuning of the model weights. The first phase dedicates to freezing model weights and optimizing these handles to facilitate initial reconstructions, laying groundwork for the subsequent stage which involves subtle weight tuning to prevent overfitting. This carefully calibrated approach aims to accurately grasp individual concept identities while nurturing editability and contextual adaptability.

A pivotal aspect of their methodology is employing a masked diffusion loss designed to ensure precise reproduction of identified concepts. Diverging from existing techniques, this method also introduces a novel loss function based on cross-attention maps, ensuring disentanglement of concept-specific handles. Complementing their methodology, the authors propose a union-sampling technique to aggregate multiple concepts simultaneously, enhancing the reliability of generating complex, multi-concept imagery.

Extensive evaluations and user studies underpin the effectiveness of this approach, illustrating noticeable improvements in both prompt and identity similarity metrics compared to existing baselines such as Textual Inversion, DreamBooth, and Custom Diffusion. The quantitative and qualitative results validate that the proposed method strikingly balances fidelity to concept identity with compliance to textual prompts, placing it at the forefront in scenarios involving single-image, multi-concept extractions.

The implications of this paper are multifaceted. Practically, it opens avenues for more versatile applications within creative industries by enabling elaborate image transformations. Theoretically, it encourages further research into scene decomposition and concept disentanglement, potentially influencing methodologies in automated training without necessitating large databases.

Given these advancements, the future may reveal further enhancements in model efficiency and speed, alongside addressing current limitations, such as light sensitivity overfitting and pose fixation. This research underscores a significant stride in refining T2I personalization, providing a foundational toolset for generating complex visuals from minimalist inputs, ultimately broadening the horizons for AI-mediated image synthesis.

Youtube Logo Streamline Icon: https://streamlinehq.com