Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Artist: Aesthetically Controllable Text-Driven Stylization without Training (2407.15842v1)

Published 22 Jul 2024 in cs.CV and cs.GR

Abstract: Diffusion models entangle content and style generation during the denoising process, leading to undesired content modification when directly applied to stylization tasks. Existing methods struggle to effectively control the diffusion model to meet the aesthetic-level requirements for stylization. In this paper, we introduce \textbf{Artist}, a training-free approach that aesthetically controls the content and style generation of a pretrained diffusion model for text-driven stylization. Our key insight is to disentangle the denoising of content and style into separate diffusion processes while sharing information between them. We propose simple yet effective content and style control methods that suppress style-irrelevant content generation, resulting in harmonious stylization results. Extensive experiments demonstrate that our method excels at achieving aesthetic-level stylization requirements, preserving intricate details in the content image and aligning well with the style prompt. Furthermore, we showcase the highly controllability of the stylization strength from various perspectives. Code will be released, project home page: https://DiffusionArtist.github.io

An Analytical Overview of "Artist: Aesthetically Controllable Text-Driven Stylization without Training"

The paper "Artist: Aesthetically Controllable Text-Driven Stylization without Training" by Ruixiang Jiang and Changwen Chen introduces an innovative approach to text-driven image stylization using diffusion models, without involving additional training phases. This introduces a novel paradigm where aesthetically fine-grained control over content and style generation is achieved by disentangling these elements into separate but integrated processes.

Key Insights and Methodologies

Diffusion models, known for their strong generative capabilities, often intertwine content and style generation, leading to unwanted content alterations. The primary objective of this work is to disentangle these processes to ensure that style generation does not compromise the integrity of the original content. The authors achieve this by introducing Artist, a method that leverages pretrained diffusion models with auxiliary branches for content and style control.

Content and Style Disentanglement

Central to this approach is the separation of content and style denoising into distinct diffusion trajectories. Using auxiliary branches:

  1. Content Delegation: This branch is responsible for preserving the original content structure during the denoising process. The main branch is controlled by injecting hidden features from the content delegation, thus ensuring that crucial content details are maintained.
  2. Style Delegation: This branch focuses on generating the desired stylization according to the provided text prompt. The style guidance is injected into the main branch through adaptive instance normalization (AdaIN), which aligns style statistics seamlessly with the main content.

The researchers introduced the concept of content-to-style (C2S) injection, which ensures that style-related denoising is contextually aware of the content, leading to a more harmonious integration of style into the content.

Control Mechanisms

Artist allows for aesthetic-level control over the stylization process by tuning the injection layers and leveraging large Visual-LLMs (VLMs) to ensure alignment with human aesthetic preferences. Experiments highlighted the model's capability to balance stylization strength and content preservation while maintaining fine-grained controllability.

Experimental Evaluation

The authors conducted extensive qualitative and quantitative evaluations. Noteworthy findings include:

  • Qualitative Results: The method produced high-quality stylizations across diverse styles, retaining intricate details of the original content while embedding strong stylistic features.
  • Quantitative Results: The paper introduced novel aesthetic-level metrics using VLMs to evaluate the outputs, considering not just perceptual similarity and prompt alignment, but also aesthetic quality. Artist consistently outperformed existing methods across these new metrics.

Metrics like LPIPS, CLIP Alignment, and newly proposed VLM-based metrics (e.g., Content-Aware Style Alignment and Style-Aware Content Alignment) demonstrated that Artist yields superior content preservation and style alignment compared to other state-of-the-art methods.

Implications and Future Directions

The proposed approach and findings pose significant implications for the field of generative AI and neural stylization:

  • Practical Applications: The ability to control stylization strength and content preservation without additional training makes Artist highly practical for real-world applications in digital art, media production, and personalized content creation.
  • Theoretical Advancements: The use of auxiliary branches for disentangled control introduces a new dimension in the understanding and application of diffusion models. This method could inspire further research into the modular control of other generative processes.
  • Future Developments: Looking forward, integrating human preference signals more deeply into the diffusion model’s training loop could enhance the aesthetic alignment even further. This future advancement could bridge the gap between generated content and human-like artistic preferences more closely.

Conclusion

The work "Artist" by Jiang and Chen sets a new benchmark in the field of text-driven image stylization. It underscores the potential inherent in diffusion models to generate aesthetically coherent stylizations by leveraging disentangled auxiliary processes. This research not only advances the theoretical framework of neural stylization but also offers practical tools for artists and creators seeking to harness AI in crafting visually compelling content. As the field progresses, the methodologies and insights introduced by this paper will likely serve as foundational elements for subsequent innovations in AI-driven artistic creation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  3. Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8795–8805, 2024.
  4. Diffusion in style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2251–2261, 2023.
  5. Stanislav Fort. Pixels still beat text: Attacking the openai clip model with text patches and adversarial pixel perturbations, 2021.
  6. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022a.
  7. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022b.
  8. R-lpips: An adversarially robust perceptual similarity metric. arXiv preprint arXiv:2307.15157, 2023.
  9. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  10. Delta denoising score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328–2337, 2023.
  11. Style aligned image generation via shared attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4775–4785, 2024.
  12. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  13. Diffstyler: Controllable dual diffusion for text-driven image stylization. IEEE Transactions on Neural Networks and Learning Systems, 2024.
  14. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  15. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14371–14382, 2023.
  16. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
  17. Style transfer by relaxed optimal transport and self-similarity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10051–10060, 2019.
  18. Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4422–4431, 2019.
  19. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18062–18071, 2022.
  20. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
  21. Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024.
  22. Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
  23. Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7465–7475, 2024.
  24. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  25. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  27. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  28. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  29. A style-aware content loss for real-time hd style transfer. In proceedings of the European conference on computer vision (ECCV), pages 698–714, 2018.
  30. Chaehan So. Measuring aesthetic preferences of neural style transfer: More precision with the two-alternative-forced-choice task. International Journal of Human–Computer Interaction, 39(4):755–775, 2023.
  31. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  32. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  33. Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics, 2023a.
  34. Aesust: towards aesthetic-enhanced universal style transfer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1095–1106, 2022.
  35. Stylediffusion: Controllable disentangled style transfer via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7677–7689, 2023b.
  36. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
  37. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
  38. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  39. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023b.
  40. A unified arbitrary style transfer framework via adaptive contrastive learning. ACM Transactions on Graphics, 42(5):1–16, 2023c.
  41. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ruixiang Jiang (7 papers)
  2. Changwen Chen (12 papers)