Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adding Conditional Control to Text-to-Image Diffusion Models (2302.05543v3)

Published 10 Feb 2023 in cs.CV, cs.AI, cs.GR, cs.HC, and cs.MM

Abstract: We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (99)
  1. Sadia Afrin. Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight-initialization-in-neural-network-inspired-by-andrew-ng-e0066dc4a566, 2020.
  2. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7319–7328, Online, Aug. 2021. Association for Computational Linguistics.
  3. Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG), 40(4), 2021.
  4. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18511–18521, 2022.
  5. Alembics. Disco diffusion, https://github.com/alembics/disco-diffusion, 2022.
  6. Spatext: Spatio-textual representation for controllable image generation. arXiv preprint arXiv:2211.14305, 2022.
  7. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  8. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  9. Masksketch: Unpaired structure-guided masked image generation. arXiv preprint arXiv:2302.05496, 2023.
  10. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  11. John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):679–698, 1986.
  12. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  13. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021.
  14. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023.
  15. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018.
  16. darkstorm2150. Protogen x3.4 (photorealism) official release, https://civitai.com/models/3666/protogen-x34-photorealism-official-release, 2022.
  17. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  18. Hyperinverter: Improving stylegan inversion via hypernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11389–11398, 2022.
  19. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021.
  20. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022.
  21. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  22. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  23. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  24. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  25. Hypernetworks. In International Conference on Learning Representations, 2017.
  26. Heathen. Hypernetwork style training, a tiny guide, stable-diffusion-webui, https://github.com/automatic1111/stable-diffusion-webui/discussions/2670, 2022.
  27. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  28. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  29. Classifier-free diffusion guidance, 2022.
  30. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799, 2019.
  31. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  32. Composer: Creative and controllable image synthesis with composable conditions. 2023.
  33. Region-aware diffusion for zero-shot text-driven image editing. arXiv preprint arXiv:2302.11797, 2023.
  34. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017.
  35. OneFormer: One Transformer to Rule Universal Image Segmentation. 2023.
  36. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations, 2018.
  37. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  38. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis, 2021.
  39. Multi-level latent space structuring for generative control. arXiv preprint arXiv:2202.05910, 2022.
  40. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  41. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  42. Variational diffusion models. Advances in Neural Information Processing Systems, 34:21696–21707, 2021.
  43. Kurumuz. Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac, 2022.
  44. Deep learning. Nature, 521(7553):436–444, May 2015.
  45. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  46. Noise2noise: Learning image restoration without clean data. Proceedings of the 35th International Conference on Machine Learning, 2018.
  47. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Representations, 2018.
  48. Gligen: Open-set grounded text-to-image generation. 2023.
  49. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
  50. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
  51. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vision (ECCV), pages 67–82, 2018.
  52. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  53. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  54. Midjourney. https://www.midjourney.com/, 2023.
  55. Self-distilled stylegan: Towards generation from internet photos. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
  56. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  57. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. CoRR, 2021.
  58. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. 2022.
  59. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  60. Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022.
  61. ogkalu. Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022.
  62. OpenAI. Dall-e-2, https://openai.com/product/dall-e-2, 2023.
  63. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019.
  64. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023.
  65. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2085–2094, October 2021.
  66. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  67. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  68. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  69. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020.
  70. Efficient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8119–8127, 2018.
  71. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  72. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  73. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI International Conference, pages 234–241, 2015.
  74. Incremental learning through deep adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):651–663, 2018.
  75. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  76. Learning representations by back-propagating errors. Nature, 323(6088):533–536, Oct. 1986.
  77. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY, USA, 2022. Association for Computing Machinery.
  78. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  79. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  80. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
  81. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  82. Stability. Stable diffusion v1.5 model card, https://huggingface.co/runwayml/stable-diffusion-v1-5, 2022.
  83. Stability. Stable diffusion v2 model card, stable-diffusion-2-depth, https://huggingface.co/stabilityai/stable-diffusion-2-depth, 2022.
  84. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995, 2019.
  85. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. arXiv preprint arXiv:2112.06825, 2021.
  86. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  87. Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463, 2019.
  88. Sketch-guided text-to-image diffusion models. 2022.
  89. Pretraining is all you need for image-to-image translation. 2022.
  90. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018.
  91. Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1395–1403, 2015.
  92. Side-tuning: Network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), pages 698–714. Springer, 2020.
  93. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5143–5153, 2020.
  94. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  95. Zero initialization: Initializing residual networks with only zeros and ones. arXiv, 2021.
  96. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 633–641, 2017.
  97. Cocosnet v2: Full-resolution correspondence learning for image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11465–11475, 2021.
  98. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  99. Toward multimodal image-to-image translation. Advances in Neural Information Processing Systems, 30, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Lvmin Zhang (6 papers)
  2. Anyi Rao (28 papers)
  3. Maneesh Agrawala (42 papers)
Citations (2,975)

Summary

Adding Conditional Control to Text-to-Image Diffusion Models

The paper "Adding Conditional Control to Text-to-Image Diffusion Models" authored by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala presents an innovative framework termed ControlNet, designed to enhance the controllability and functionality of large, pretrained text-to-image diffusion models like Stable Diffusion. The core objective of ControlNet is to introduce spatial conditioning into these models, enabling more precise control over generated images by incorporating additional input images depicting specific spatial configurations or attributes (e.g., edge maps, human poses, segmentation maps).

Overview of ControlNet

ControlNet operates by leveraging the robust encoding layers of large pretrained models while introducing trainable layers that allow for conditional input, integrated through "zero convolutions". This approach maintains the integrity of the pretrained model while progressively adapting to the new conditioning input, guaranteeing stability and preventing the introduction of harmful noise during the initial stages of training. The architecture performs a dual function: it retains the semantic richness and generalization capabilities of the pretrained model and adapts it for specific, condition-based image generation tasks.

Methodology

ControlNet’s architecture involves cloning the network blocks of the pretrained model into trainable copies and connecting them with zero-initialized convolution layers. This ensures no detrimental interference during early training phases. The process of embedding conditioning images into the latent space aligns with the model's internal representations, further augmented by applying encoding to transform spatial details from the input conditions into the required feature space. ControlNet is introduced across multiple levels of the U-Net structure of diffusion models, ensuring comprehensive integration of conditional data at various depths of the network.

Experimental Results

The paper provides an extensive quantitative and qualitative evaluation of the efficacy of ControlNet. The experimental setups incorporate diverse conditioning inputs, including Canny edges, human poses, depth maps, and segmentation maps. The outcomes demonstrate that ControlNet can effectively manage these varied conditions, often leading to high-fidelity and semantically coherent image outputs. Training evaluations highlight its robustness against overfitting, even with limited datasets, a significant advantage given the typically smaller datasets available for highly specific conditions compared to the large-scale datasets used for pretraining models like Stable Diffusion.

Noteworthy numerical results from user studies rank ControlNet significantly higher in terms of result quality and fidelity to input conditions compared to baseline methods. Additionally, comparative analysis places ControlNet favorably against models trained with extensive industrial resources, yielding competitive results with significantly fewer computational resources.

Implications and Future Directions

From a practical standpoint, ControlNet's framework enhances the usability of text-to-image diffusion models in applications requiring high degrees of specificity and user control such as content creation, animation, and precise visual storytelling. The ability to fine-tune using diverse conditional images without the need for extensive retraining opens the door to broad adaptability, ensuring that pretrained models can be leveraged across various tasks without sacrificing performance.

Theoretically, this research underscores the potential of modular, fine-tuning-based approaches in advancing the capabilities of complex pretrained models. Future developments might explore further enhancements in the integration efficiency of the added controls, the extension to additional forms of conditioning data, and the application of ControlNet's principles in other generative model domains such as video generation or multimodal data synthesis.

In conclusion, the introduction of ControlNet marks a promising advancement in the controllability of diffusion-based image generation, blending the strengths of large-scale pretrained models with the varied requirements of specific tasks, thereby broadening the scope and utility of generative neural networks in both research and application contexts.

Youtube Logo Streamline Icon: https://streamlinehq.com