Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CCM: Adding Conditional Controls to Text-to-Image Consistency Models (2312.06971v1)

Published 12 Dec 2023 in cs.CV

Abstract: Consistency Models (CMs) have showed a promise in creating visual content efficiently and with high quality. However, the way to add new conditional controls to the pretrained CMs has not been explored. In this technical report, we consider alternative strategies for adding ControlNet-like conditional control to CMs and present three significant findings. 1) ControlNet trained for diffusion models (DMs) can be directly applied to CMs for high-level semantic controls but struggles with low-level detail and realism control. 2) CMs serve as an independent class of generative models, based on which ControlNet can be trained from scratch using Consistency Training proposed by Song et al. 3) A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training, allowing for the swift transfer of DMs-based ControlNet to CMs. We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image with text-to-image latent consistency models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. John Canny. A computational approach to edge detection. IEEE TPAMI, 1986.
  2. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
  3. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
  4. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  5. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  6. Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488, 2022.
  7. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017.
  8. Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023.
  9. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023a.
  10. Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023b.
  11. Cones 2: Customizable image synthesis with multiple subjects. arXiv preprint arXiv:2305.19327, 2023c.
  12. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
  13. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023b.
  14. On distillation of guided diffusion models. In CVPR, 2023.
  15. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147, 2023.
  16. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 2020.
  17. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  18. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  19. Imagenet large scale visual recognition challenge. IJCV, 2015.
  20. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  21. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  22. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  23. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  24. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  25. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  26. Pixel difference networks for efficient edge detection. In ICCV, 2021.
  27. Holistically-nested edge detection. In ICCV, 2015.
  28. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
  29. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  30. Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jie Xiao (89 papers)
  2. Kai Zhu (93 papers)
  3. Han Zhang (338 papers)
  4. Zhiheng Liu (22 papers)
  5. Yujun Shen (111 papers)
  6. Yu Liu (786 papers)
  7. Xueyang Fu (29 papers)
  8. Zheng-Jun Zha (144 papers)
Citations (9)

Summary

  • The paper demonstrates that integrating ControlNets via consistency training outperforms DM-based methods in managing fine image details.
  • The methodology compares applying an existing ControlNet, training one from scratch, and using a lightweight adapter to merge conditional controls into consistency models.
  • Experiments across various visual conditions show that the unified adapter significantly boosts generation quality in edge, depth, human pose, and masked image tasks.

Introduction

Consistency Models (CMs) are gaining attention for their capability to generate high-quality images with efficiency. Despite their advancements, integrating new conditional controls into pretrained CMs remains an uncharted area. Aiming to enhance CMs with ControlNet—a system developed for diffusion models (DMs)—this technical report explores three strategies, evaluating their effectiveness across different visual conditions.

Methodology

The approach begins with establishing a baseline text-to-image CM, which could be a model trained via consistency distillation from DMs or by direct training from data. For the first strategy, an existing ControlNet optimized for DMs is applied to CMs. The second strategy focuses on training a ControlNet from scratch specifically for the CM using consistency training. Lastly, the third strategy introduces a lightweight adapter to seamlessly transplant multiple DM-based ControlNets into a CM environment.

Experimental Setup

The strategies were assessed on a variety of visual conditions: edge, depth, human pose, low-resolution image, and masked image, using their respective specialized extraction or detection techniques. The experimental process encompassed rigorous GPU computation to train the foundational CM, ControlNets, and the unified adapter.

Findings and Conclusion

The experiments revealed that while DM-based ControlNets could endow CMs with high-level semantic control, they often stumbled to effectively manage low-level details. Conversely, ControlNets trained via consistency training for CMs showcased superior conditional image generation capabilities. The fusion of DMs-based ControlNets with the CMs was notably improved with the use of a unified adapter, achieving better visual outcomes. These results illustrate the potential of tailored training strategies for integrating conditional controls into CMs, highlighting a methodical path forward in the field of image generation.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

  1. CCM