Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt (2404.11474v3)
Abstract: Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highly realistic artistic stylized images. However, diffusion model-based methods generally fail to preserve the content structure of input content images well, introducing some undesired content structure and style patterns. To address the above problems, we propose a novel pre-trained diffusion-based artistic style transfer method, called LSAST, which can generate highly realistic artistic stylized images while preserving the content structure of input content images well, without bringing obvious artifacts and disharmonious style patterns. Specifically, we introduce a Step-aware and Layer-aware Prompt Space, a set of learnable prompts, which can learn the style information from the collection of artworks and dynamically adjusts the input images' content structure and style pattern. To train our prompt space, we propose a novel inversion method, called Step-ware and Layer-aware Prompt Inversion, which allows the prompt space to learn the style information of the artworks collection. In addition, we inject a pre-trained conditional branch of ControlNet into our LSAST, which further improved our framework's ability to maintain content structure. Extensive experiments demonstrate that our proposed method can generate more highly realistic artistic stylized images than the state-of-the-art artistic style transfer methods.
- An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. arXiv preprint arXiv:2311.11919, 2023.
- Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017.
- Testnerf: text-driven 3d style transfer via cross-modal learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 5788–5796, 2023.
- Attention graph: Learning effective visual features for large-scale image classification. Journal of Algorithms & Computational Technology, 16:17483026211065375, 2022.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2427–2436, 2019.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Phasic content fusing diffusion model with directional distribution consistency for few-shot model adaption. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2406–2415, 2023.
- Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning, pages 1857–1865. PMLR, 2017.
- U-gat-it: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830, 2019.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Self-reference image super-resolution via pre-trained diffusion large model and window adjustable transformer. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7981–7992, 2023.
- Dudoinet: Dual-domain implicit network for multi-modality mr image arbitrary-scale super-resolution. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7335–7344, 2023.
- Rethinking multi-contrast mri super-resolution: Rectangle-window cross-attention transformer and arbitrary-scale upsampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21230–21240, 2023.
- Rethinking diffusion model for multi-contrast mri super-resolution. arXiv preprint arXiv:2404.04785, 2024.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- K. Nichol. Painter by numbers, wikiart, 2016.
- Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- A style-aware content loss for real-time hd style transfer. In proceedings of the European conference on computer vision (ECCV), pages 698–714, 2018.
- Vgos: voxel grid optimization for view synthesis from sparse inputs. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 1414–1422, 2023.
- 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos. arXiv preprint arXiv:2403.01444, 2024.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Aesust: Towards aesthetic-enhanced universal style transfer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1095–1106, 2022.
- Xianchao Wu. Creative painting with latent diffusion models. arXiv preprint arXiv:2209.14697, 2022.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
- Gating patternpyramid for diversified image style transfer. Journal of Electronic Imaging, 31(6):063007, 2022.
- Generating cartoon images from face photos with cycle-consistent adversarial networks. Computers, Materials and Continua, 2021.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Prospect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023.
- Inversion-based creativity transfer with diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Caster: Cartoon style transfer via dynamic cartoon style casting. Neurocomputing, 556:126654, 2023.
- Rethink arbitrary style transfer with transformer and contrastive learning. Computer Vision and Image Understanding, page 103951, 2024.
- Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank. Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
- The spatially-correlative loss for various image translation tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16407–16417, 2021.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
- Generative image inpainting with segmentation confusion adversarial training and contrastive learning. arXiv preprint arXiv:2303.13133, 2023.
- Zhanjie Zhang (14 papers)
- Quanwei Zhang (6 papers)
- Huaizhong Lin (7 papers)
- Wei Xing (34 papers)
- Juncheng Mo (2 papers)
- Shuaicheng Huang (1 paper)
- Jinheng Xie (25 papers)
- Guangyuan Li (32 papers)
- Junsheng Luan (7 papers)
- Lei Zhao (808 papers)
- Dalong Zhang (11 papers)
- Lixia Chen (1 paper)