Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoopAnimate: Loopable Salient Object Animation (2404.09172v2)

Published 14 Apr 2024 in cs.CV and cs.AI

Abstract: Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao et al., “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv:2402.17177, 2024.
  2. X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024.
  3. Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
  4. J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, and Y. Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190, 2023.
  5. Y. Zeng, G. Wei, J. Zheng, J. Zou, Y. Wei, Y. Zhang, and H. Li, “Make pixels dance: High-dynamic video generation,” arXiv preprint arXiv:2311.10982, 2023.
  6. Y. Zhang, Z. Xing, Y. Zeng, Y. Fang, and K. Chen, “Pia: Your personalized image animator via plug-and-play modules in text-to-image models,” arXiv preprint arXiv:2312.13964, 2023.
  7. X. Chen, Z. Liu, M. Chen, Y. Feng, Y. Liu, Y. Shen, and H. Zhao, “Livephoto: Real image animation with text-guided motion control,” arXiv preprint arXiv:2312.02928, 2023.
  8. W. Peebles and S. Xie, “Scalable diffusion models with transformers,” 2023.
  9. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022.
  10. A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, V. Jampani, and R. Rombach, “Stable video diffusion: Scaling latent video diffusion models to large datasets,” 2023.
  11. S. Zhang, J. Wang, Y. Zhang, K. Zhao, H. Yuan, Z. Qin, X. Wang, D. Zhao, and J. Zhou, “I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models,” 2023.
  12. “Pika labs,” 2023. [Online]. Available: https://www.pika.art/
  13. “Midjourney,” 2022. [Online]. Available: https://www.midjourney.com//
  14. “sd-image-variations-diffusers,” 2023. [Online]. Available: https://huggingface.co/lambdalabs/sd-image-variations-diffusers
  15. R. reseachers, “Gen-2: The next step forward for generative ai,” 2023.10. 2, 6, 7, 8. [Online]. Available: https://research.runwayml.com/
  16. Y. Qiao, F. Wang, J. Su, Y. Zhang, Y. Yu, S. Wu, and G.-J. Qi, “Baret: Balanced attention based real image editing driven by target-text inversion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4560–4568.
  17. P. Liu, F. Wang, J. Su, Y. Zhang, and G. Qi, “Lightweight high-resolution subject matting in the real world,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 3440–3444.
  18. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in IEEE International Conference on Computer Vision, 2021.
  19. A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” 2018.
  20. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015.
  21. X. Guo, M. Zheng, L. Hou, Y. Gao, Y. Deng, P. Wan, D. Zhang, Y. Liu, W. Hu, Z. Zha, H. Huang, and C. Ma, “I2v-adapter: A general image-to-video adapter for diffusion models,” 2024.
  22. H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang, C. Weng, and Y. Shan, “Videocrafter1: Open diffusion models for high-quality video generation,” 2023.
  23. X. Shi, Z. Huang, F.-Y. Wang, W. Bian, D. Li, Y. Zhang, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling,” 2024.
  24. Y. Ma, Y. He, H. Wang, A. Wang, C. Qi, C. Cai, X. Li, Z. Li, H.-Y. Shum, W. Liu, and Q. Chen, “Follow-your-click: Open-domain regional image animation via short prompts,” 2024.
  25. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  26. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” 2020.
  27. S. Basu, N. Zhao, V. Morariu, S. Feizi, and V. Manjunatha, “Localizing and editing knowledge in text-to-image generative models,” 2023.
  28. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Neural Information Processing Systems,Neural Information Processing Systems, Jun 2017.
  29. X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou, “Videocomposer: Compositional video synthesis with motion controllability,” 2023.
  30. J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li, “Pixart-α: Fast training of diffusion trans-former for photorealistic text-to-image synthesis.”
  31. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. Kamyar, S. Ghasemipour, B. Karagol, S. Mahdavi, R. Lopes, T. Salimans, J. Ho, D. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding.”
  32. Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, P. Luo, and R. Urbino, “Raphael: Text-to-image generation via large mixture of diffusion paths.”
  33. J. Wu, Y. Ge, X. Wang, W. Lei, Y. Gu, W. Hsu, Y. Shan, X. Qie, and M. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” Dec 2022.
  34. J. Zhang, H. Yan, Z. Xu, J. Feng, and J. Liew, “Magicavatar: Multimodal avatar generation and animation.”
  35. L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image diffusion models are zero-shot video generators,” 2023.
  36. A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. Kim, S. Fidler, and K. Kreis, “Align your latents,” Apr 2023.
  37. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y. Taigman, “Make-a-video: Text-to-video generation without text-video data,” Sep 2022.
  38. T.-S. Chen, C. H. Lin, H.-Y. Tseng, T.-Y. Lin, and M.-H. Yang, “Motion-conditioned diffusion model for controllable video synthesis,” arXiv preprint arXiv:2304.14404, 2023.
  39. Y. Hu, C. Luo, and Z. Chen, “Make it move: controllable image-to-video generation with text descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 219–18 228.
  40. J. Karras, A. Holynski, T.-C. Wang, and I. Kemelmacher-Shlizerman, “Dreampose: Fashion image-to-video synthesis via stable diffusion,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV).   IEEE, 2023, pp. 22 623–22 633.
  41. D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” 2023.
  42. C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” 2023.
  43. X. Li, C. Ma, X. Yang, and M.-H. Yang, “Vidtome: Video token merging for zero-shot video editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  44. H. Ma, S. Mahdizadehaghdam, B. Wu, Z. Fan, Y. Gu, W. Zhao, L. Shapira, and X. Xie, “Maskint: Video editing via interpolative non-autoregressive masked transformers,” arxiv preprint, 2023.
  45. Q. Wang, X. Jia, X. Li, T. Li, L. Ma, Y. Zhuge, and H. Lu, “Stableidentity: Inserting anybody into anywhere at first sight,” arXiv preprint arXiv:2401.15975, 2024.
  46. O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, and I. Mosseri, “Lumiere: A space-time diffusion model for video generation,” 2024.
  47. L. Qu, S. Wu, H. Fei, L. Nie, and T.-S. Chua, “Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation,” Proceedings of the ACM International Conference on Multimedia, 2023.
  48. Z. Sun, Y. Zhou, H. He, and P. Mok, “Sgdiff: A style guided diffusion model for fashion synthesis,” in Proceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 8433–8442. [Online]. Available: https://doi.org/10.1145/3581783.3613806
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Fanyi Wang (18 papers)
  2. Peng Liu (372 papers)
  3. Haotian Hu (14 papers)
  4. Dan Meng (32 papers)
  5. Jingwen Su (7 papers)
  6. Jinjin Xu (8 papers)
  7. Yanhao Zhang (33 papers)
  8. Xiaoming Ren (6 papers)
  9. Zhiwang Zhang (9 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com