Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 29 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

DreamWalk: Style Space Exploration using Diffusion Guidance (2404.03145v1)

Published 4 Apr 2024 in cs.CV

Abstract: Text-conditioned diffusion models can generate impressive images, but fall short when it comes to fine-grained control. Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform "prompt engineering," constructing special text sentences to control the style or amount of a particular subject present in the output image. Our goal is to provide fine-grained control over the style and substance specified by the prompt, for example to adjust the intensity of styles in different regions of the image (Figure 1). Our approach is to decompose the text prompt into conceptual elements, and apply a separate guidance term for each element in a single diffusion process. We introduce guidance scale functions to control when in the diffusion process and \emph{where} in the image to intervene. Since the method is based solely on adjusting diffusion guidance, it does not require fine-tuning or manipulating the internal layers of the diffusion model's neural network, and can be used in conjunction with LoRA- or DreamBooth-trained models (Figure2). Project page: https://mshu1.github.io/dreamwalk.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Break-A-Scene: Extracting Multiple Concepts from a Single Image. arXiv preprint arXiv:2305.16311 (2023).
  2. Multi-content gan for few-shot font style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7564–7573.
  3. Multidiffusion: Fusing diffusion paths for controlled image generation. (2023).
  4. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28, 3 (2009), 24.
  5. Subject-driven Text-to-Image Generation via Apprenticeship Learning. arXiv preprint arXiv:2304.00186 (2023).
  6. Alexei A Efros and Thomas K Leung. 1999. Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2. IEEE, 1033–1038.
  7. Hugging Face. [n. d.]. Prompt weighting. https://huggingface.co/docs/diffusers/using-diffusers/weighted_prompts.
  8. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  9. Texture synthesis using convolutional neural networks. Advances in neural information processing systems 28 (2015).
  10. Paul Haeberli. 1990. Paint by numbers: Abstract image representations. In Proceedings of the 17th annual conference on Computer graphics and interactive techniques. 207–214.
  11. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems 33 (2020), 9841–9850.
  12. David J Heeger and James R Bergen. 1995. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. 229–238.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  14. Aaron Hertzmann. 2003. A survey of stroke-based rendering. Institute of Electrical and Electronics Engineers.
  15. Image analogies. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 557–570.
  16. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  17. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  18. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  20. On the” steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 (2019).
  21. Neural style transfer: A review. IEEE transactions on visualization and computer graphics 26, 11 (2019), 3365–3385.
  22. A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4396–4405. https://doi.org/10.1109/CVPR.2019.00453
  23. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
  24. Syncdiffusion: Coherent montage via synchronized joint diffusions. arXiv preprint arXiv:2306.05178 (2023).
  25. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22511–22521.
  26. Peter Litwinowicz. 1997. Processing images and video for an impressionist effect. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 407–414.
  27. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision. Springer, 423–439.
  28. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410 (2023).
  29. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
  30. Understanding the latent space of diffusion models through the lens of riemannian geometry. arXiv preprint arXiv:2307.12868 (2023).
  31. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
  32. Javier Portilla and Eero P Simoncelli. 2000. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision 40 (2000), 49–70.
  33. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10619–10629.
  34. Substance or Style: What Does Your Image Embedding Know? arXiv preprint arXiv:2307.05610 (2023).
  35. Generative modelling with inverse heat dissipation. arXiv preprint arXiv:2206.13397 (2022).
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  37. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  38. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  39. Interactive pen-and-ink illustration. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 393–400.
  40. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9243–9252.
  41. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023).
  42. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  43. StyleDrop: Text-to-Image Generation in Any Style. arXiv preprint arXiv:2306.00983 (2023).
  44. Damian Stewart. 2023. Compel. https://github.com/damian0815/compel.
  45. Stylerig: Rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6142–6151.
  46. Andrey Voynov and Artem Babenko. 2020. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning. PMLR, 9786–9796.
  47. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1900–1910.
  48. Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. 199–208.
  49. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12863–12872.
  50. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796 (2022).
  51. Adding Conditional Control to Text-to-Image Diffusion Models.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.