DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization (2211.10682v2)
Abstract: Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into a stylized one according to textual descriptions of the target style provided by the user. Unlike the previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler, a dual diffusion processing architecture to control the balance between the content and style of the diffused results. The cross-modal style information can be easily integrated as guidance during the diffusion process step-by-step. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments. Code is available at \url{https://github.com/haha-lisa/Diffstyler}.
- L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423.
- X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1501–1510.
- Z. Ma, T. Lin, X. Li, F. Li, D. He, E. Ding, N. Wang, and X. Gao, “Dual-affinity style embedding network for semantic-aligned image style transfer,” IEEE Transactions on Neural Networks and Learning Systems, no. 10, pp. 7404–7417, 2023.
- S. Jiang, J. Li, and Y. Fu, “Deep learning for fashion style generation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4538–4550, 2021.
- X. Kong, Y. Deng, F. Tang, W. Dong, C. Ma, Y. Chen, Z. He, and C. Xu, “Exploring the temporal consistency of arbitrary style transfer: A channelwise perspective,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023.
- C. Liang, M. Zhu, N. Wang, H. Yang, and X. Gao, “PMSGAN: Parallel multistage gans for face image translation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2023.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410.
- H. Wang, Y. Li, Y. Wang, H. Hu, and M.-H. Yang, “Collaborative distillation for ultra-resolution universal style transfer,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1860–1869.
- O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2085–2094.
- R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Transactions on Graphics, pp. 141:1–141:13, 2022.
- G. Kwon and J. C. Ye, “Clipstyler: Image style transfer with a single text condition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18 062–18 071.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, pp. 139–144, 2020.
- P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances Neural Information Processing Systems (NeurIPS), 2021, pp. 8780–8794.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
- A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695.
- N. Huang, F. Tang, W. Dong, and C. Xu, “Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion,” in ACM International Conference on Multimedia (ACM MM), 2022, pp. 1085–1094.
- X. Wu, “Creative painting with latent diffusion models,” arXiv preprint arXiv:2209.14697, 2022.
- R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
- T.-J. Fu, X. E. Wang, and W. Y. Wang, “Language-driven artistic style transfer,” in European Conference on Computer Vision (ECCV), 2022, pp. 717–734.
- J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased image style transfer via reversible neural flows,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 862–871.
- Y. Deng, F. Tang, W. Dong, C. Ma, X. Pan, L. Wang, and C. Xu, “Stytr22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Image style transfer with transformers,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 326–11 336.
- Y.-S. Liao and C.-R. Huang, “Semantic context-aware image style transfer,” IEEE Transactions on Image Processing, vol. 31, pp. 1911–1923, 2022.
- D. diffusion, “Disco diffusion,” https://github.com/alembics/disco-diffusion.
- N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 500–22 510.
- J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision (ECCV), 2016, pp. 694–711.
- Y. Zhang, F. Tang, W. Dong, H. Huang, C. Ma, T.-Y. Lee, and C. Xu, “Domain enhanced arbitrary image style transfer via contrastive learning,” in SIGGRAPH (Conference Paper Track), 2022, pp. 12:1–12:8.
- N. Huang, F. Tang, W. Dong, T.-Y. Lee, and C. Xu, “Region-aware diffusion for zero-shot text-driven image editing,” arXiv preprint arXiv:2302.11797, 2023.
- Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T.-Y. Lee, O. Deussen, and C. Xu, “ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models,” ACM Transactions on Graphics, vol. 42, no. 6, pp. 244:1–244:14, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021, pp. 8748–8763.
- J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International Conference on Machine Learning (ICML), 2015, pp. 2256–2265.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Advances in Neural Information Processing Systems (NIPS), 2019, pp. 11 895–11 907.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6840–6851, 2020.
- G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 2426–2435.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 36 479–36 494, 2022.
- A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning (ICML), 2021, pp. 8162–8171.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations (ICLR), 2021.
- L. Liu, Y. Ren, Z. Lin, and Z. Zhao, “Pseudo numerical methods for diffusion models on manifolds,” in International Conference on Learning Representations (ICLR), 2022.
- E. Chu, S.-Y. Lin, and J.-C. Chen, “Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models,” arXiv preprint arXiv:2305.19193, 2023.
- W. Huang, S. Tu, and L. Xu, “Pfb-diff: Progressive feature blending diffusion for text-driven image editing,” arXiv preprint arXiv:2306.16894, 2023.
- Z. Luo, D. Chen, Y. Zhang, Y. Huang, L. Wang, Y. Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10 209–10 218.
- S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3558–3568.
- C. Katherine, “v-diffusion-pytorch,” https://github.com/crowsonkb/v-diffusion-pytorch.
- W. volunteer team, “Wikiart dataset,” https://www.wikiart.org/.
- T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in European conference on computer vision (ECCV), 2020, pp. 319–345.
- C. Katherine, “Simulacra aesthetic models,” https://github.com/crowsonkb/simulacra-aesthetic-models.
- J. D. Pressman, “Simulacra aesthetic captions,” https://github.com/JD-P/simulacra-aesthetic-captions.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2020.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 586–595.
- S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,” in IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6649–6658.
- Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary video style transfer via multi-channel correlation,” in in Proc. 35th AAAI Conf. Artif. Intell. (AAAI), vol. 35, no. 2, 2021, pp. 1210–1217.
- L. Wen, C. Gao, and C. Zou, “Cap-vstnet: Content affinity preserved versatile style transfer,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 300–18 309.
- Y. Zhang, F. Tang, W. Dong, H. Huang, C. Ma, T.-Y. Lee, and C. Xu, “A unified arbitrary style transfer framework via adaptive contrastive learning,” ACM Transactions on Graphics, vol. 42, no. 5, pp. 169:1–169:14, 2023.
- B. Romain, M. Clay, H. Dominik, L. Pedro, K. Martin, V. David, and L. Guillaume, “Clip retrieval system,” https://github.com/rom1504/clip-retrieval.
- Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” Advances in neural information processing systems (NIPS), vol. 30, 2017.
- Nisha Huang (10 papers)
- Yuxin Zhang (91 papers)
- Fan Tang (46 papers)
- Chongyang Ma (52 papers)
- Haibin Huang (60 papers)
- Yong Zhang (660 papers)
- Weiming Dong (50 papers)
- Changsheng Xu (100 papers)