Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization (2403.19456v2)

Published 28 Mar 2024 in cs.CV, cs.GR, and cs.MM

Abstract: Personalized generation paradigms empower designers to customize visual intellectual properties with the help of textual descriptions by tuning or adapting pre-trained text-to-image models on a few images. Recent works explore approaches for concurrently customizing both content and detailed visual style appearance. However, these existing approaches often generate images where the content and style are entangled. In this study, we reconsider the customization of content and style concepts from the perspective of parameter space construction. Unlike existing methods that utilize a shared parameter space for content and style, we propose a learning framework that separates the parameter space to facilitate individual learning of content and style, thereby enabling disentangled content and style. To achieve this goal, we introduce "partly learnable projection" (PLP) matrices to separate the original adapters into divided sub-parameter spaces. We propose "break-for-make" customization learning pipeline based on PLP, which is simple yet effective. We break the original adapters into "up projection" and "down projection", train content and style PLPs individually with the guidance of corresponding textual prompts in the separate adapters, and maintain generalization by employing a multi-correspondence projection learning strategy. Based on the adapters broken apart for separate training content and style, we then make the entity parameter space by reconstructing the content and style PLPs matrices, followed by fine-tuning the combined adapter to generate the target object with the desired appearance. Experiments on various styles, including textures, materials, and artistic style, show that our method outperforms state-of-the-art single/multiple concept learning pipelines in terms of content-style-prompt alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. A Neural Space-Time Representation for Text-to-Image Personalization. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1–10.
  2. Break-A-Scene: Extracting Multiple Concepts from a Single Image. In SIGGRAPH Asia 2023 Conference Papers (, Sydney, NSW, Australia,) (SA ’23). Association for Computing Machinery, New York, NY, USA, Article 96, 12 pages. https://doi.org/10.1145/3610548.3618154
  3. Blended latent diffusion. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–11.
  4. SpaText: Spatio-Textual Representation for Controllable Image Generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18370–18380. https://doi.org/10.1109/CVPR52729.2023.01762
  5. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022).
  6. Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–11.
  7. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2 (2023), 3.
  8. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18392–18402.
  9. Muse: Text-To-Image Generation via Masked Generative Transformers. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 4055–4075.
  10. Subject-driven text-to-image generation via apprenticeship learning. Advances in Neural Information Processing Systems 36 (2024).
  11. Qlora: Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems 36 (2024).
  12. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337 (2022).
  13. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650 (2022).
  14. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  15. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–13.
  16. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305 (2023).
  17. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  18. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  19. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  21. CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion. arXiv preprint arXiv:2401.14066 (2024).
  22. Fedpara: Low-rank hadamard product for communication-efficient federated learning. arXiv preprint arXiv:2108.06098 (2021).
  23. OpenCLIP. https://doi.org/10.5281/zenodo.5143773
  24. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6007–6017.
  25. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
  26. Cones 2: Customizable Image Synthesis with Multiple Subjects. arXiv preprint arXiv:2305.19327 (2023).
  27. Midjourney. 2023. Midjourney. https://www.midjourney.com/.
  28. mkshing. 2023. ZiploRA. https://github.com/mkshing/ziplora-pytorch.
  29. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6038–6047.
  30. DINOv2: Learning Robust Visual Features without Supervision.
  31. Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models. (May 2023).
  32. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  33. AdapterFusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247 (2020).
  34. Martin Philip. 2023. unsplash. https://unsplash.com/@phlmrtn.
  35. Orthogonal adaptation for modular customization of diffusion models. arXiv preprint arXiv:2312.02432 (2023).
  36. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  39. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  40. ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs. arXiv preprint arXiv:2311.13600 (2023).
  41. Ryu Simo. 2023. LoRA. https://github.com/cloneofsimo/lora.
  42. StyleDrop: Text-to-Image Generation in Any Style. arXiv preprint arXiv:2306.00983 (2023).
  43. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
  44. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  45. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558 (2022).
  46. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  47. P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv preprint arXiv:2303.09522 (2023).
  48. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023).
  49. Resolving Interference When Merging Models. arXiv preprint arXiv:2306.01708 (2023).
  50. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512 (2023).
  51. ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation. arXiv preprint arXiv:2305.16225 (2023).
  52. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146–10156.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets