Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization (2211.10682v2)

Published 19 Nov 2022 in cs.CV and cs.GR

Abstract: Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into a stylized one according to textual descriptions of the target style provided by the user. Unlike the previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler, a dual diffusion processing architecture to control the balance between the content and style of the diffused results. The cross-modal style information can be easily integrated as guidance during the diffusion process step-by-step. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments. Code is available at \url{https://github.com/haha-lisa/Diffstyler}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423.
  2. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1501–1510.
  3. Z. Ma, T. Lin, X. Li, F. Li, D. He, E. Ding, N. Wang, and X. Gao, “Dual-affinity style embedding network for semantic-aligned image style transfer,” IEEE Transactions on Neural Networks and Learning Systems, no. 10, pp. 7404–7417, 2023.
  4. S. Jiang, J. Li, and Y. Fu, “Deep learning for fashion style generation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4538–4550, 2021.
  5. X. Kong, Y. Deng, F. Tang, W. Dong, C. Ma, Y. Chen, Z. He, and C. Xu, “Exploring the temporal consistency of arbitrary style transfer: A channelwise perspective,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023.
  6. C. Liang, M. Zhu, N. Wang, H. Yang, and X. Gao, “PMSGAN: Parallel multistage gans for face image translation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2023.
  7. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410.
  8. H. Wang, Y. Li, Y. Wang, H. Hu, and M.-H. Yang, “Collaborative distillation for ultra-resolution universal style transfer,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1860–1869.
  9. O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2085–2094.
  10. R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Transactions on Graphics, pp. 141:1–141:13, 2022.
  11. G. Kwon and J. C. Ye, “Clipstyler: Image style transfer with a single text condition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18 062–18 071.
  12. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, pp. 139–144, 2020.
  13. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances Neural Information Processing Systems (NeurIPS), 2021, pp. 8780–8794.
  14. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  15. A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  16. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695.
  17. N. Huang, F. Tang, W. Dong, and C. Xu, “Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion,” in ACM International Conference on Multimedia (ACM MM), 2022, pp. 1085–1094.
  18. X. Wu, “Creative painting with latent diffusion models,” arXiv preprint arXiv:2209.14697, 2022.
  19. R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
  20. T.-J. Fu, X. E. Wang, and W. Y. Wang, “Language-driven artistic style transfer,” in European Conference on Computer Vision (ECCV), 2022, pp. 717–734.
  21. J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased image style transfer via reversible neural flows,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 862–871.
  22. Y. Deng, F. Tang, W. Dong, C. Ma, X. Pan, L. Wang, and C. Xu, “Stytr22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Image style transfer with transformers,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 326–11 336.
  23. Y.-S. Liao and C.-R. Huang, “Semantic context-aware image style transfer,” IEEE Transactions on Image Processing, vol. 31, pp. 1911–1923, 2022.
  24. D. diffusion, “Disco diffusion,” https://github.com/alembics/disco-diffusion.
  25. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 500–22 510.
  26. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision (ECCV), 2016, pp. 694–711.
  27. Y. Zhang, F. Tang, W. Dong, H. Huang, C. Ma, T.-Y. Lee, and C. Xu, “Domain enhanced arbitrary image style transfer via contrastive learning,” in SIGGRAPH (Conference Paper Track), 2022, pp. 12:1–12:8.
  28. N. Huang, F. Tang, W. Dong, T.-Y. Lee, and C. Xu, “Region-aware diffusion for zero-shot text-driven image editing,” arXiv preprint arXiv:2302.11797, 2023.
  29. Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T.-Y. Lee, O. Deussen, and C. Xu, “ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models,” ACM Transactions on Graphics, vol. 42, no. 6, pp. 244:1–244:14, 2023.
  30. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021, pp. 8748–8763.
  31. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International Conference on Machine Learning (ICML), 2015, pp. 2256–2265.
  32. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Advances in Neural Information Processing Systems (NIPS), 2019, pp. 11 895–11 907.
  33. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6840–6851, 2020.
  34. G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 2426–2435.
  35. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 36 479–36 494, 2022.
  36. A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning (ICML), 2021, pp. 8162–8171.
  37. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations (ICLR), 2021.
  38. L. Liu, Y. Ren, Z. Lin, and Z. Zhao, “Pseudo numerical methods for diffusion models on manifolds,” in International Conference on Learning Representations (ICLR), 2022.
  39. E. Chu, S.-Y. Lin, and J.-C. Chen, “Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models,” arXiv preprint arXiv:2305.19193, 2023.
  40. W. Huang, S. Tu, and L. Xu, “Pfb-diff: Progressive feature blending diffusion for text-driven image editing,” arXiv preprint arXiv:2306.16894, 2023.
  41. Z. Luo, D. Chen, Y. Zhang, Y. Huang, L. Wang, Y. Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10 209–10 218.
  42. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3558–3568.
  43. C. Katherine, “v-diffusion-pytorch,” https://github.com/crowsonkb/v-diffusion-pytorch.
  44. W. volunteer team, “Wikiart dataset,” https://www.wikiart.org/.
  45. T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in European conference on computer vision (ECCV), 2020, pp. 319–345.
  46. C. Katherine, “Simulacra aesthetic models,” https://github.com/crowsonkb/simulacra-aesthetic-models.
  47. J. D. Pressman, “Simulacra aesthetic captions,” https://github.com/JD-P/simulacra-aesthetic-captions.
  48. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2020.
  49. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
  50. S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
  51. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 586–595.
  52. S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,” in IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6649–6658.
  53. Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary video style transfer via multi-channel correlation,” in in Proc. 35th AAAI Conf. Artif. Intell. (AAAI), vol. 35, no. 2, 2021, pp. 1210–1217.
  54. L. Wen, C. Gao, and C. Zou, “Cap-vstnet: Content affinity preserved versatile style transfer,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 300–18 309.
  55. Y. Zhang, F. Tang, W. Dong, H. Huang, C. Ma, T.-Y. Lee, and C. Xu, “A unified arbitrary style transfer framework via adaptive contrastive learning,” ACM Transactions on Graphics, vol. 42, no. 5, pp. 169:1–169:14, 2023.
  56. B. Romain, M. Clay, H. Dominik, L. Pedro, K. Martin, V. David, and L. Guillaume, “Clip retrieval system,” https://github.com/rom1504/clip-retrieval.
  57. Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” Advances in neural information processing systems (NIPS), vol. 30, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Nisha Huang (10 papers)
  2. Yuxin Zhang (91 papers)
  3. Fan Tang (46 papers)
  4. Chongyang Ma (52 papers)
  5. Haibin Huang (60 papers)
  6. Yong Zhang (660 papers)
  7. Weiming Dong (50 papers)
  8. Changsheng Xu (100 papers)
Citations (37)

Summary

Text-Driven Image Stylization using Dual Diffusion: An Overview of DiffStyler

The paper "DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization," authored by Huang et al., introduces DiffStyler, a novel framework designed to facilitate text-driven image stylization via a dual-diffusion architecture. This work emerges against the backdrop of prevailing challenges in the domain of text-driven image stylization, where the primary aim is to stylize an image based on descriptive textual input. This objective distinguishes it from traditional image-centric style transfer methods, which necessitate a reference style image to guide the stylization process.

Overview of DiffStyler's Approach

DiffStyler Architecture

At its core, DiffStyler innovates by integrating a dual diffusion process. The objective is to harmonize the control over content and style within the stylization output. This is a significant departure from conventional feed-forward CNN pipelines, which often struggle with maintaining content fidelity in the face of cross-modal inputs, such as text and image. The architecture of DiffStyler leverages two diffusion models to process the text prompt and the content image separately. The interplay between these models enables the system to synthesize an image that captures the specified artistic style without compromising the content integrity of the original image.

Content Preservation through Learnable Noise

One of the paper's critical contributions is the use of learnable noise during the stylization process. Traditional diffusion models often apply random noise, which can be detrimental to preserving fine structural details of the input image. By introducing content-aware learnable noise, DiffStyler can better retain the structure and geometry of the original image, thus overcoming a significant limitation observed in earlier methods.

Numerical Method Optimizations

The authors have employed numerical methods that refine the simulation of the diffusion process, thereby enhancing both the quality and efficiency of the sampling process. This methodological adjustment enables DiffStyler to surpass traditional generation techniques without a corresponding increase in computational overhead.

Results and Implications

Upon benchmarking against baseline methods such as GAN-based approaches, DiffStyler demonstrated superior performance both quantitatively and qualitatively. The reported metrics indicate a significant improvement in content retention and stylization accuracy. Such robust performance suggests that dual diffusion models, when guided by textual input, present a versatile and sophisticated alternative to existing style transfer frameworks.

Practical Implications

The ability of DiffStyler to interactively stylize images using text descriptions opens up broad applications in personalized digital artwork creation, media content creation, and aesthetics-driven design processes. This advancement in controllable stylization could redefine user engagement with digital art tools, providing an accessible interface for non-experts to articulate art styles through natural language.

Theoretical Insights and Future Directions

The theoretical implications of this research extend into the broader domain of generative models where cross-modal processing is essential. The dual diffusion strategy offers a promising blueprint for future explorations integrating text with image and video-based content generation. Future research trajectories could investigate the application of this framework to dynamic data, such as video frames, thereby addressing temporal consistency challenges in video stylization.

In conclusion, DiffStyler stands as a noteworthy contribution to the field of text-driven image processing, underscoring the potential of diffusion models in achieving nuanced image transformations guided by textual input. Its integration of learnable noise and refined numerical methods poses a compelling argument for the use of diffusion architectures in complex generative tasks, positioning it as a meaningful step towards more sophisticated, user-friendly digital artistry.

Github Logo Streamline Icon: https://streamlinehq.com