Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model (2404.09967v2)
Abstract: ControlNets are widely used for adding spatial control to text-to-image diffusion models with different conditions, such as depth maps, scribbles/sketches, and human poses. However, when it comes to controllable video generation, ControlNets cannot be directly integrated into new backbones due to feature space mismatches, and training ControlNets for new backbones can be a significant burden for many users. Furthermore, applying ControlNets independently to different frames cannot effectively maintain object temporal consistency. To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion model through the adaptation of pretrained ControlNets. Ctrl-Adapter offers strong and diverse capabilities, including image and video control, sparse-frame video control, fine-grained patch-level multi-condition control (via an MoE router), zero-shot adaptation to unseen conditions, and supports a variety of downstream tasks beyond spatial control, including video editing, video style transfer, and text-guided motion control. With six diverse U-Net/DiT-based image/video diffusion models (SDXL, PixArt-$\alpha$, I2VGen-XL, SVD, Latte, Hotshot-XL), Ctrl-Adapter matches the performance of pretrained ControlNets on COCO and achieves the state-of-the-art on DAVIS 2017 with significantly lower computation (< 10 GPU hours).
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM, Apr. 2024.
- SpaText: Spatio-Textual Representation for Controllable Image Generation. In CVPR, nov 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024.
- Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In CVPR 2024, 2024.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- Scaling rectified flow transformers for high-resolution image synthesis, 2024.
- G. Farnebäck. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13, pages 363–370. Springer, 2003.
- G. Farnebäck. Two-frame motion estimation based on polynomial expansion. In Scandinavian Conference on Image Analysis, 2003.
- Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In ECCV, 2022.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2023.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In International Conference on Learning Representations, 2024.
- Photorealistic video generation with diffusion models, 2023.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Latent video diffusion models for high-fidelity long video generation, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pages 13213–13232. PMLR, 2023.
- Parameter-efficient transfer learning for nlp. In ICML, volume abs/1902.00751, 2019.
- Z. Hu and D. Xu. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In ICCV 2023, 2023.
- Diffblender: Scalable and composable multimodal text-to-image diffusion models. arXiv preprint arXiv:2305.15194, 2023.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
- BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing. In NeurIPS, 2023.
- Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Video generation from text. In AAAI, 2017.
- Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Videodrafter: Content-consistent multi-scene video generation with llm. arXiv preprint arXiv:2401.01256, 2024.
- Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
- Snap video: Scaled spatiotemporal transformers for text-to-video synthesis, 2024.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In AAAI 2024, 2023.
- Hotshot-XL, Oct. 2023.
- OpenAI. Video generation models as world simulators, 2024.
- On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Unicontrol: A unified diffusion model for controllable visual generation in the wild. In NeurIPS 2023, 2023.
- Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022.
- X-adapter: Adding universal compatibility of plugins for upgraded diffusion model. In CVPR, 2024.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
- A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4161–4170, 2017.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- C. Rowles. Stable Video Diffusion Temporal Controlnet. https://github.com/CiaraStrawberry/svd-temporal-controlnet, 2023.
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In CVPR, 2023.
- S. Ryu. Low-rank adaptation for fast text-to-image diffusion fine-tuning, 2022.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In NeurIPS, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Outrageously Large Neural Networks: the Sparsely-Gated Mixture-of-Experts Layer. In ICLR, 2017.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023.
- Modelscope text-to-video technical report, 2023.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
- Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- M. B. Yi-Lin Sung, Jaemin Cho. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022.
- NUWA-XL: Diffusion over diffusion for eXtremely long video generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1309–1320, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.
- Controlvideo: Training-free controllable text-to-video generation. In ICLR, 2024.
- Learning to forecast and refine residual motion for image-to-video generation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
- Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Han Lin (53 papers)
- Jaemin Cho (36 papers)
- Abhay Zala (10 papers)
- Mohit Bansal (304 papers)