Papers
Topics
Authors
Recent
2000 character limit reached

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

Published 23 Oct 2024 in cs.CV | (2410.17594v1)

Abstract: Custom diffusion models (CDMs) have attracted widespread attention due to their astonishing generative ability for personalized concepts. However, most existing CDMs unreasonably assume that personalized concepts are fixed and cannot change over time. Moreover, they heavily suffer from catastrophic forgetting and concept neglect on old personalized concepts when continually learning a series of new concepts. To address these challenges, we propose a novel Concept-Incremental text-to-image Diffusion Model (CIDM), which can resolve catastrophic forgetting and concept neglect to learn new customization tasks in a concept-incremental manner. Specifically, to surmount the catastrophic forgetting of old concepts, we develop a concept consolidation loss and an elastic weight aggregation module. They can explore task-specific and task-shared knowledge during training, and aggregate all low-rank weights of old concepts based on their contributions during inference. Moreover, in order to address concept neglect, we devise a context-controllable synthesis strategy that leverages expressive region features and noise estimation to control the contexts of generated images according to user conditions. Experiments validate that our CIDM surpasses existing custom diffusion models. The source codes are available at https://github.com/JiahuaDong/CIFC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics, 42(4), jul 2023.
  2. Subject-driven text-to-image generation via apprenticeship learning. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 30286–30305. Curran Associates, Inc., 2023.
  3. Anydoor: Zero-shot object-level image customization. arxiv preprint arxiv:2307.09481, 2023.
  4. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
  5. No one left behind: Real-world federated class-incremental learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2054–2070, 2024.
  6. Heterogeneous forgetting compensation for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11742–11751, October 2023.
  7. Federated class-incremental learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022.
  8. Dytox: Transformers for continual learning with dynamic token expansion. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9275–9285, 2022.
  9. Ranni: Taming text-to-image diffusion for accurate instruction following. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2024.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023.
  11. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, June 2022.
  12. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. In Advances in Neural Information Processing Systems, 2023.
  13. Svdiff: Compact parameter space for diffusion fine-tuning. In 2023 IEEE/CVF International Conference on Computer Vision, pages 7289–7300, 2023.
  14. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  15. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  16. Compacting, picking and growing for unforgetting continual learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13677–13687, 2019.
  17. Memory-efficient incremental learning through feature adaptation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 699–715, 2020.
  18. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 5967–5976, 2017.
  19. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arxiv preprint arxiv:2304.02642, 2023.
  20. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022.
  21. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  22. Omg: Occlusion-friendly personalized multi-concept generation in diffusion models. arxiv preprint arxiv:2403.10983, 2024.
  23. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  25. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, June 2023.
  26. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
  27. Referring image editing: Object-level image editing via referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13128–13138, 2024.
  28. Cones: concept neurons in diffusion models for customized generation. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  29. Customizable image synthesis with multiple subjects. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  30. Null-text inversion for editing real images using guided diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  31. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In AAAI Conference on Artificial Intelligence, 2024.
  32. DiffuseVAE: Efficient, controllable and high-fidelity generation from low-dimensional latents. Transactions on Machine Learning Research, 2022.
  33. Semantic image synthesis with spatially-adaptive normalization. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2332–2341, 2019.
  34. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
  35. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  36. Hierarchical text-conditional image generation with clip latents. arxiv preprint arxiv:2204.06125, 2022.
  37. icarl: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 5533–5542, 2017.
  38. Generative adversarial text to image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1060–1069, Jun 2016.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  40. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, 2015.
  41. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  42. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  43. Instantbooth: Personalized text-to-image generation without test-time finetuning. arxiv preprint arxiv:2304.03411, 2023.
  44. Continual learning with deep generative replay. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 2994–3003, 2017.
  45. A survey of multimodal-guided image editing with text-to-image diffusion models. arxiv preprint arxiv:2406.14555, 2024.
  46. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. Transactions on Machine Learning Research, 2024.
  47. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  48. Create your world: Lifelong text-to-image diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(9):6454–6470, 2024.
  49. Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
  50. Learning to imagine: Diversify memory for incremental learning using unlabeled data. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9539–9548, 2022.
  51. Texttoucher: Fine-grained text-to-touch generation. arXiv preprint arXiv:2409.05427, 2024.
  52. Driveditfit: Fine-tuning diffusion transformers for autonomous driving. arXiv preprint arXiv:2407.15661, 2024.
  53. Anti-dreambooth: Protecting users from personalized text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2116–2127, 2023.
  54. P+: Extended textual conditioning in text-to-image generation. arxiv preprint arxiv:2303.09522, 2023.
  55. Apisr: Anime production inspired real-world anime super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25574–25584, 2024.
  56. Orthogonal subspace learning for language model continual learning. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  57. Stylediffusion: Controllable disentangled style transfer via diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision, pages 7643–7655, 2023.
  58. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, October 2023.
  59. Better aligning text-to-image models with human preference. In 2023 IEEE/CVF International Conference on Computer Vision, 2023.
  60. Mixture of loRA experts. In The Twelfth International Conference on Learning Representations, 2024.
  61. Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models. arxiv preprint arxiv:2403.11627, 2024.
  62. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018.
  63. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  64. Continual named entity recognition without catastrophic forgetting. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8186–8197, 2023.
  65. MM-LLMs: Recent advances in MultiModal large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 12401–12430, August 2024.
  66. Composing parameter-efficient modules with arithmetic operation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  67. Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision, pages 3813–3824, 2023.
  68. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10146–10156, June 2023.
  69. Motiondirector: Motion customization of text-to-video diffusion models. arxiv preprint arxiv:2310.08465, 2023.
  70. Multi-lora composition for image generation. arxiv preprint arxiv:2402.16843, 2024.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.