Visual Style Prompting with Swapping Self-Attention (2402.12974v2)
Abstract: In the evolving domain of text-to-image generation, diffusion models have emerged as powerful tools in content creation. Despite their remarkable capability, existing models still face challenges in achieving controlled generation with a consistent style, requiring costly fine-tuning or often inadequately transferring the visual elements due to content leakage. To address these challenges, we propose a novel approach, \ours, to produce a diverse range of images while maintaining specific style elements and nuances. During the denoising process, we keep the query from original features while swapping the key and value with those from reference features in the late self-attention layers. This approach allows for the visual style prompting without any fine-tuning, ensuring that generated images maintain a faithful style. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, best reflecting the style of the references and ensuring that resulting images match the text prompts most accurately. Our project page is available https://curryjung.github.io/VisualStylePrompt/.
- Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia, 2023.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV), 2023.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Diffusion in style. In ICCV, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2022.
- Highly personalized text embedding for image manipulation by stable diffusion. arXiv, 2023.
- Style aligned image generation via shared attention. arXiv, 2023.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
- Nsml: Meet the mlaas platform with a real-world case study. arXiv, 2018.
- Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
- Diffusion models already have a semantic latent space. In ICLR, 2023.
- Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
- Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In ICCV, 2021.
- Visual instruction inversion: Image editing via visual prompting. NeurIPS, 2023.
- Arbitrary style transfer with style-attentional networks. In CVPR, 2019.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, pp. 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. CoRR, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Ryu, S. Low-rank adaptation for fast text-to-image diffusion fine-tuning. https://github.com/cloneofsimo/lora, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In CVPR, 2018.
- Styledrop: Text-to-image synthesis of any style. In NeurIPS, 2023.
- Denoising diffusion implicit models. In ICLR, 2020.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- Diffusion image analogies. In SIGGRAPH, 2023.
- Imagebrush: Learning visual in-context instructions for exemplar-based image manipulation. NeurIPS, 2023.
- Nsml: A machine learning platform that enables you to focus on your models. arXiv, 2017.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Attention is all you need. In NeurIPS, 2017.
- p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv, 2023.
- Styleadapter: A single-pass lora-free model for stylized image generation. arXiv, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
- Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.