X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model (2312.02238v3)
Abstract: We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
- Animecreative. https://civitai.com/models/146785.
- Kawaiitech. https://civitai.com/models/94663.
- Vangoghportraiture. https://civitai.com/models/157794.
- Animeoutline. https://civitai.com/models/16014.
- Moxin. https://civitai.com/models/12597.
- Toonyou. https://civitai.com/models/30240.
- Stability AI. https://huggingface.co/runwayml/stable-diffusion-v1-5, a.
- Stability AI. https://huggingface.co/stabilityai/stable-diffusion-2-1-base, b.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arxiv:2310.19512, 2023.
- Diffusion models beat gans on image synthesis. arXiv preprint arxiv:2105.05233, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arxiv:2208.01618, 2022.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Deep residual learning for image recognition. pages 770–778, 2016.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Parameter-efficient transfer learning for NLP. pages 2790–2799, 2019.
- https://civitai.com/. civitai. https://civitai.com/, 2013.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Auto-encoding variational bayes. arXiv preprint arxiv:1312.6114, 2013.
- Variational diffusion models. arXiv preprint arxiv:2107.00630, 2021.
- Gligen: Open-set grounded text-to-image generation. arXiv preprint arxiv:2301.07093, 2023.
- Bridge diffusion model: bridge non-english language-native text-to-image diffusion model with english communities. arXiv preprint arXiv:2309.00952, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arxiv:2108.01073, 2022.
- Midjourney. https://www.midjourney.com/.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arxiv:2112.10741, 2022.
- OpenAI. Dall-e2. https://openai.com/dall-e-2, a.
- OpenAI. Dall-e3. https://openai.com/dall-e-3, b.
- Semantic image synthesis with spatially-adaptive normalization. arXiv preprint arxiv:1903.07291, 2019.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arxiv:2303.09535, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. 2021.
- U-net: Convolutional networks for biomedical image segmentation. arXiv preprint arxiv:1505.04597, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arxiv:2208.12242, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arxiv:2205.11487, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arxiv:1503.03585, 2015.
- Attention is all you need. arXiv preprint arxiv:1706.03762, 2017.
- Styleadapter: A single-pass lora-free model for stylized image generation. arXiv preprint arxiv:2309.01770, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arxiv:2308.06721, 2023.
- Inserting anybody in diffusion models via celeb basis. arXiv preprint arXiv:2306.00926, 2023.
- Taca: Upgrading your visual foundation model with task-agnostic compatible adapter. arXiv preprint arxiv:2306.12642, 2023a.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arxiv:2309.15818, 2023b.
- Adding conditional control to text-to-image diffusion models. 2023c.
- Revisit parameter-efficient transfer learning: A two-stage paradigm. arXiv preprint arxiv:2303.07910, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.