Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models (2404.03913v1)

Published 5 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023.
  2. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  3. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  4. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
  5. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023a.
  6. Improving tuning-free real image editing with proximal guidance, 2023b.
  7. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2023.
  8. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  9. Lora: Low-rank adaptation of large language models, 2021.
  10. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  11. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  12. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing, 2023a.
  13. Generate anything anywhere in any scene, 2023b.
  14. Compositional visual generation with composable diffusion models, 2023a.
  15. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  16. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  17. Energy-based cross attention for bayesian context update in text-to-image diffusion models, 2023.
  18. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147, 2023.
  19. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  20. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  21. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  22. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  23. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  24. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  25. Key-locked rank one editing for text-to-image personalization, 2023.
  26. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  27. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  28. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  29. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, 2017.
  30. Inversion-based style transfer with diffusion models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Gihyun Kwon (17 papers)
  2. Simon Jenni (25 papers)
  3. Dingzeyu Li (18 papers)
  4. Joon-Young Lee (61 papers)
  5. Jong Chul Ye (210 papers)
  6. Fabian Caba Heilbron (34 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com