Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding (2404.05256v2)

Published 8 Apr 2024 in cs.CV and cs.AI

Abstract: Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. Using approximately 15 to 20 images of the target style, Single-StyleForge establishes a foundational binding of a unique token identifier with a broad range of attributes of the target style. Additionally, auxiliary images are incorporated for dual binding that guides the consistent representation of crucial elements such as people within the target style. Furthermore, we present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes. Experimental evaluations across six distinct artistic styles demonstrate significant improvements in image quality and perceptual fidelity, as measured by FID, KID, and CLIP scores.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. [n. d.]. Hugging Face. /https://huggingface.co.
  2. [n. d.]. pixel-art. /https://www.kaggle.com/datasets.
  3. [n. d.]. WikiArt. /https://www.wikiart.org/.
  4. DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models. arXiv preprint arXiv:2309.06933 (2023).
  5. Demystifying MMD GANs. arXiv:1801.01401 [stat.ML]
  6. Muse: Text-To-Image Generation via Masked Generative Transformers. arXiv:2301.00704 [cs.CV]
  7. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning. arXiv:2211.11337 [cs.CV]
  8. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  9. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.
  10. Mark Hamazaspyan and Shant Navasardyan. 2023. Diffusion-Enhanced PatchMatch: A Framework for Arbitrary Style Transfer With Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 797–805.
  11. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning. arXiv:2303.11305 [cs.CV]
  12. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv:2208.01626 [cs.CV]
  13. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021).
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, Vol. 30.
  15. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  17. Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501–1510.
  18. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).
  19. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
  20. Multi-concept customization of text-to-image diffusion. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
  22. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the IEEE/CVF international conference on computer vision. 6649–6658.
  23. Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14267–14276.
  24. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
  25. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  27. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
  28. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  29. High-resolution image synthesis with latent diffusion models. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
  30. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  31. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949 (2023).
  32. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  33. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585 [cs.LG]
  34. StyleDrop: Text-to-Image Generation in Any Style. arXiv:2306.00983 [cs.CV]
  35. Denoising Diffusion Implicit Models. arXiv:2010.02502 [cs.LG]
  36. StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7677–7689.
  37. Vector-quantized Image Modeling with Improved VQGAN. arXiv:2110.04627 [cs.CV]
  38. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2, 3 (2022), 5.
  39. Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023).
  40. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146–10156.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Junseo Park (2 papers)
  2. Beomseok Ko (2 papers)
  3. Hyeryung Jang (24 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.