Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-LoRA Composition for Image Generation (2402.16843v2)

Published 26 Feb 2024 in cs.CV, cs.AI, cs.CL, cs.GR, and cs.LG
Multi-LoRA Composition for Image Generation

Abstract: Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: LoRA Switch, which alternates between different LoRAs at each denoising step, and LoRA Composite, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish ComposLoRA, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition. The code, benchmarks, LoRA weights, and all evaluation details are available on our project website: https://maszhongming.github.io/Multi-LoRA-Composition.

Enhancing Text-to-Image Models with Multi-LoRA Composition

Introduction

The ability to generate complex images by integrating multiple specific elements through Low-Rank Adaptation (LoRA) represents a significant advancement in the field of generative text-to-image models. Despite the precision and computational efficiency offered by LoRA, the challenge of composing multiple LoRAs, especially as the number increases, remains a notable limitation. This paper confronts this challenge by proposing two novel, training-free methods to improve multi-LoRA composition: LoRA Switch and LoRA Composite. These methods are evaluated using a newly developed testbed, ComposLoRA, demonstrating a substantial improvement over existing composition techniques.

Multi-LoRA Composition Methodology

Underlying Challenges

The intricacy of image generation increases exponentially with the number of specific elements or LoRAs to be integrated. Previous methodologies struggled with scalability and the realistic composition of multiple LoRAs due to their reliance on weight manipulation, which often resulted in unstable merging processes and degraded interaction between the LoRAs and the base models.

Proposed Solutions

The paper presents two innovative approaches that maintain the integrity of LoRA weights while addressing compositional challenges:

  • LoRA Switch (LoRA-s): This approach selectively activates a single LoRA at each denoising step of the image generation process, systematically rotating among multiple LoRAs. It ensures that each element is given focused attention, thus preserving the quality of both the specific elements and the overall image.
  • LoRA Composite (LoRA-c): Drawing from the concept of classifier-free guidance, this method calculates unconditional and conditional score estimates for each LoRA at every denoising step. By averaging these scores, it provides balanced guidance for image synthesis, ensuring cohesive integration of all elements.

Evaluation Framework

A novel evaluation framework, ComposLoRA, was established to assess the effectiveness of the proposed methods, featuring a comprehensive array of LoRA categories and composition sets. The framework employs GPT-4V for evaluating the quality of images and the success of compositions. Both automated and human evaluations affirm the superior performance of LoRA Switch and LoRA Composite methods over traditional LoRA merging approaches, especially noticeable as the number of LoRAs in a composition increases.

Implications and Future Directions

The proposed decoding-centric perspective on multi-LoRA composition offers a promising advancement in the field of text-to-image generation. By overcoming the limitations of weight manipulation methods, the paper paves the way for more complex and detailed image generation capabilities. The introduction of the ComposLoRA testbed and the employment of GPT-4V as an evaluator represent significant contributions to the standardization and assessment of image generation tasks.

Future research may explore optimizing the activation sequences and intervals for LoRA Switch, exploring the nuances of composition quality in varying image styles, and addressing the positional bias identified in GPT-4V evaluations. Moreover, the broader applicability of LoRA-based methods in other domains of AI could be an exciting avenue for exploration, potentially enhancing the customization and precision of generative models beyond images.

In conclusion, this paper not only addresses a critical gap in our understanding of multi-LoRA composition but also sets a foundation for future advancements in generative AI, offering both theoretical and practical contributions to the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  7319–7328. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.ACL-LONG.568. URL https://doi.org/10.18653/v1/2021.acl-long.568.
  2. Diffusion models beat gans on image synthesis. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  8780–8794, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/49ad23d1ec9fa4bd8d77d02681df5cfa-Abstract.html.
  3. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment. CoRR, abs/2312.09979, 2023. doi: 10.48550/ARXIV.2312.09979. URL https://doi.org/10.48550/arXiv.2312.09979.
  4. Compositional visual generation with energy based models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/49856ed476ad01fcff881d57e161d73f-Abstract.html.
  5. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  8489–8510. PMLR, 2023. URL https://proceedings.mlr.press/v202/du23a.html.
  6. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=PUIqjT4rzq7.
  7. Make-a-scene: Scene-based text-to-image generation with human priors. In Avidan, S., Brostow, G. J., Cissé, M., Farinella, G. M., and Hassner, T. (eds.), Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XV, volume 13675 of Lecture Notes in Computer Science, pp.  89–106. Springer, 2022. doi: 10.1007/978-3-031-19784-0_6. URL https://doi.org/10.1007/978-3-031-19784-0_6.
  8. Clipscore: A reference-free evaluation metric for image captioning. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  7514–7528. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.595. URL https://doi.org/10.18653/v1/2021.emnlp-main.595.
  9. Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022. doi: 10.48550/ARXIV.2207.12598. URL https://doi.org/10.48550/arXiv.2207.12598.
  10. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
  11. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  12. Lorahub: Efficient cross-task generalization via dynamic lora composition. CoRR, abs/2307.13269, 2023a. doi: 10.48550/ARXIV.2307.13269. URL https://doi.org/10.48550/arXiv.2307.13269.
  13. Composer: Creative and controllable image synthesis with composable conditions. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  13753–13773. PMLR, 2023b. URL https://proceedings.mlr.press/v202/huang23b.html.
  14. Collaborative diffusion for multi-modal face generation and editing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  6080–6090. IEEE, 2023c. doi: 10.1109/CVPR52729.2023.00589. URL https://doi.org/10.1109/CVPR52729.2023.00589.
  15. Hyvärinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695–709, 2005. URL http://jmlr.org/papers/v6/hyvarinen05a.html.
  16. Image generation from scene graphs. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp.  1219–1228. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 10.1109/CVPR.2018.00133. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Johnson_Image_Generation_From_CVPR_2018_paper.html.
  17. Imagenhub: Standardizing the evaluation of conditional image generation models. CoRR, abs/2310.01596, 2023. doi: 10.48550/ARXIV.2310.01596. URL https://doi.org/10.48550/arXiv.2310.01596.
  18. Composing ensembles of pre-trained models via iterative consensus. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=gmwDKo-4cY.
  19. Designbench: Exploring and benchmarking DALL-E 3 for imagining visual design. CoRR, abs/2310.15144, 2023. doi: 10.48550/ARXIV.2310.15144. URL https://doi.org/10.48550/arXiv.2310.15144.
  20. Learning to compose visual relations. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  23166–23178, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/c3008b2c6f5370b744850a98a95b73ad-Abstract.html.
  21. Compositional visual generation with composable diffusion models. In Avidan, S., Brostow, G. J., Cissé, M., Farinella, G. M., and Hassner, T. (eds.), Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII, volume 13677 of Lecture Notes in Computer Science, pp.  423–439. Springer, 2022. doi: 10.1007/978-3-031-19790-1_26. URL https://doi.org/10.1007/978-3-031-19790-1_26.
  22. Unsupervised compositional concepts discovery with text-to-image generative models. CoRR, abs/2306.05357, 2023. doi: 10.48550/ARXIV.2306.05357. URL https://doi.org/10.48550/arXiv.2306.05357.
  23. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022a. URL http://papers.nips.cc/paper_files/paper/2022/hash/260a14acce2a89dad36adc8eefe7c59e-Abstract-Conference.html.
  24. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. CoRR, abs/2211.01095, 2022b. doi: 10.48550/ARXIV.2211.01095. URL https://doi.org/10.48550/arXiv.2211.01095.
  25. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  16784–16804. PMLR, 2022. URL https://proceedings.mlr.press/v162/nichol22a.html.
  26. Controllable and compositional generation with latent-space energy-based models. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  13497–13510, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/701d804549a4a23d3cae801dac6c2c75-Abstract.html.
  27. OpenAI. GPT-4: Contributions and System Card. https://cdn.openai.com/contributions/gpt-4v.pdf, 2023a.
  28. OpenAI. GPT-4v System Card. https://openai.com/research/gpt-4v-system-card, 2023b.
  29. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022. doi: 10.48550/ARXIV.2204.06125. URL https://doi.org/10.48550/arXiv.2204.06125.
  30. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp.  10674–10685. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01042. URL https://doi.org/10.1109/CVPR52688.2022.01042.
  31. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  22500–22510. IEEE, 2023. doi: 10.1109/CVPR52729.2023.02155. URL https://doi.org/10.1109/CVPR52729.2023.02155.
  32. Ryu, S. Merging loras. https://github.com/cloneofsimo/lora, 2023.
  33. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.html.
  34. Ziplora: Any subject in any style by effectively merging loras. CoRR, abs/2311.13600, 2023. doi: 10.48550/ARXIV.2311.13600. URL https://doi.org/10.48550/arXiv.2311.13600.
  35. Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. R. and Blei, D. M. (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp.  2256–2265. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/sohl-dickstein15.html.
  36. Styledrop: Text-to-image generation in any style. CoRR, abs/2306.00983, 2023. doi: 10.48550/ARXIV.2306.00983. URL https://doi.org/10.48550/arXiv.2306.00983.
  37. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  38. Tenenbaum, J. Building machines that learn and think like people. In André, E., Koenig, S., Dastani, M., and Sukthankar, G. (eds.), Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2018, Stockholm, Sweden, July 10-15, 2018, pp.  5. International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA / ACM, 2018. URL http://dl.acm.org/citation.cfm?id=3237389.
  39. In-context learning unlocked for diffusion models. CoRR, abs/2305.01115, 2023. doi: 10.48550/ARXIV.2305.01115. URL https://doi.org/10.48550/arXiv.2305.01115.
  40. Modeling image composition for complex scene generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp.  7754–7763. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00761. URL https://doi.org/10.1109/CVPR52688.2022.00761.
  41. Composing parameter-efficient modules with arithmetic operations. CoRR, abs/2306.14870, 2023a. doi: 10.48550/ARXIV.2306.14870. URL https://doi.org/10.48550/arXiv.2306.14870.
  42. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. CoRR, abs/2311.01361, 2023b. doi: 10.48550/ARXIV.2311.01361. URL https://doi.org/10.48550/arXiv.2311.01361.
  43. Seeking neural nuggets: Knowledge transfer in large language models from a parametric perspective. CoRR, abs/2310.11451, 2023. doi: 10.48550/ARXIV.2310.11451. URL https://doi.org/10.48550/arXiv.2310.11451.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ming Zhong (88 papers)
  2. Yelong Shen (83 papers)
  3. Shuohang Wang (69 papers)
  4. Yadong Lu (19 papers)
  5. Yizhu Jiao (22 papers)
  6. Siru Ouyang (22 papers)
  7. Donghan Yu (18 papers)
  8. Jiawei Han (263 papers)
  9. Weizhu Chen (128 papers)
Citations (22)