Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models (2312.16693v4)

Published 27 Dec 2023 in cs.CV

Abstract: Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model's compatibility within the open-source communities and disrupting the model's prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  3. Videocrafter1: Open diffusion models for high-quality video generation, 2023a.
  4. Livephoto: Real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928, 2023b.
  5. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023c.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  7. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  8. Cogview: Mastering text-to-image generation via transformers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 19822–19835, 2021.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  10. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  11. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  12. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  13. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  15. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  16. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  19. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  20. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  21. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  22. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  23. Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
  24. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  27. Hotshot-XL. https://github.com/hotshotco/hotshot-xl, 2023.
  28. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2304.
  29. Pika. Pika. https://pika.art/, 2023.
  30. PlaygroundAI. MJHQ-30K Dataset. https://huggingface.co/datasets/playgroundai/MJHQ-30K, 2023.
  31. SDXL: improving latent diffusion models for high-resolution image synthesis. CoRR, abs/2307.01952, 2023.
  32. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
  33. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  35. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, ICML, pages 8821–8831. PMLR, 2021.
  36. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
  37. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
  38. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  39. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE, 2023.
  40. RunwayAI. Gen2. https://research.runwayml.com/gen2, 2023.
  41. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  42. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  43. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.
  44. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b.
  45. RAFT: recurrent all-pairs field transforms for optical flow. In ECCV (2), pages 402–419. Springer, 2020.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
  48. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  49. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  50. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
  51. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022.
  52. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In International Conference on Computer Vision (ICCV), 2023a.
  53. Lamp: Learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769, 2023b.
  54. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  55. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  56. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
  57. Adding conditional control to text-to-image diffusion models, 2023a.
  58. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Xun Guo (20 papers)
  2. Mingwu Zheng (11 papers)
  3. Liang Hou (24 papers)
  4. Yuan Gao (336 papers)
  5. Yufan Deng (11 papers)
  6. Chongyang Ma (52 papers)
  7. Weiming Hu (91 papers)
  8. Zhengjun Zha (24 papers)
  9. Haibin Huang (60 papers)
  10. Pengfei Wan (86 papers)
  11. Di Zhang (231 papers)
  12. Yufan Liu (18 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.