Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models (2401.15636v3)

Published 28 Jan 2024 in cs.CV and eess.IV

Abstract: The rapid development of generative diffusion models has significantly advanced the field of style transfer. However, most current style transfer methods based on diffusion models typically involve a slow iterative optimization process, e.g., model fine-tuning and textual inversion of style concept. In this paper, we introduce FreeStyle, an innovative style transfer method built upon a pre-trained large diffusion model, requiring no further optimization. Besides, our method enables style transfer only through a text description of the desired style, eliminating the necessity of style images. Specifically, we propose a dual-stream encoder and single-stream decoder architecture, replacing the conventional U-Net in diffusion models. In the dual-stream encoder, two distinct branches take the content image and style text prompt as inputs, achieving content and style decoupling. In the decoder, we further modulate features from the dual streams based on a given content image and the corresponding style text prompt for precise style transfer. Our experimental results demonstrate high-quality synthesis and fidelity of our method across various content images and style text prompts. Compared with state-of-the-art methods that require training, our FreeStyle approach notably reduces the computational burden by thousands of iterations, while achieving comparable or superior performance across multiple evaluation metrics including CLIP Aesthetic Score, CLIP Score, and Preference. We have released the code at: https://github.com/FreeStyleFreeLunch/FreeStyle.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Cross-image attention for zero-shot appearance transfer. arXiv preprint arXiv:2311.03335, 2023.
  2. Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11326–11336, 2022.
  3. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  4. Language-driven artistic style transfer. In European Conference on Computer Vision, pages 717–734. Springer, 2022.
  5. An image is worth one word: Personalizing text-to-image generation using textual inversion.
  6. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  7. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  8. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  9. Cartoondiff: Training-free cartoon image generation with diffusion transformer models. arXiv preprint arXiv:2309.08251, 2023.
  10. Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133, 2023.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  13. Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1085–1094, 2022.
  14. Neural style transfer: A review. IEEE transactions on visualization and computer graphics, 26(11):3365–3385, 2019.
  15. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  16. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023.
  17. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18062–18071, 2022.
  18. 3ddesigner: Towards photorealistic 3d object generation and editing with text-guided diffusion models. arXiv preprint arXiv:2211.14108, 2022.
  19. Improving image restoration through removing degradations in textual representations. arXiv preprint arXiv:2312.17334, 2023.
  20. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
  21. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  22. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  23. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094, 2021.
  24. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  25. Tip: Text-driven image processing with semantic and restoration instructions. arXiv preprint arXiv:2312.11595, 2023.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  27. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  28. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  29. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  30. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  31. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  32. A style-aware content loss for real-time hd style transfer. In proceedings of the European conference on computer vision (ECCV), pages 698–714, 2018.
  33. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  34. Hyeon-Jae Seo. Dictionary Learning for Image Style Transfer. PhD thesis, 2020.
  35. Resdiff: Combining cnn and diffusion model for image super-resolution. arXiv preprint arXiv:2303.08714, 2023.
  36. Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497, 2023.
  37. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  38. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  39. Efficient example-based painting and synthesis of 2d directional texture. IEEE Transactions on Visualization and Computer Graphics, 10(3):266–277, 2004.
  40. Stylediffusion: Controllable disentangled style transfer via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7677–7689, 2023.
  41. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  42. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  43. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
  44. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  45. Zero-shot contrastive loss for text-guided diffusion image style transfer. arXiv preprint arXiv:2303.08622, 2023.
  46. Style transfer via image component analysis. IEEE Transactions on multimedia, 15(7):1594–1601, 2013.
  47. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  48. Metastyle: Three-way trade-off among speed, flexibility, and quality in neural style transfer. In AAAI, volume 33, pages 1254–1261, 2019.
  49. Domain enhanced arbitrary image style transfer via contrastive learning. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–8, 2022.
  50. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  51. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10146–10156, 2023.
  52. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  53. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Feihong He (11 papers)
  2. Gang Li (579 papers)
  3. Mengyuan Zhang (19 papers)
  4. Lingyu Si (23 papers)
  5. Fuhui Sun (1 paper)
  6. Xiaoyan Wang (27 papers)
  7. Li Shen (363 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com