ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model (2405.15287v2)
Abstract: Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. In this paper, we present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics. Our approach introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter. The mixed style descriptor enhances SD by combining content-aware and frequency-disentangled embeddings from CLIP with additional sources that capture global statistics and textual information, thus providing a richer blend of style-related and semantic-related knowledge. To achieve a better balance between adapter capacity and semantic control, the dynamic attention adapter is integrated into the diffusion UNet, dynamically calculating adaptation weights based on the style descriptors. Additionally, we introduce two objective functions to optimize the model alongside the denoising loss, further enhancing semantic and style consistency. Extensive experiments demonstrate the superiority of ArtWeaver over existing methods, producing images with diverse target styles while maintaining the semantic integrity of the text prompts.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022.
- Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
- Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11326–11336, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23164–23173, 2023.
- Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
- Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Microast: Towards super-fast ultra-resolution arbitrary style transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2742–2750, 2023.
- Styleadapter: A unified stylized image generation model without test-time fine-tuning.
- Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024.
- Zero-shot contrastive loss for text-guided diffusion image style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22873–22882, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023.
- Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023.