Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models (2407.15642v2)

Published 22 Jul 2024 in cs.CV

Abstract: Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

Introduction

The field of image-to-video (I2V) generation, or image animation, has been a longstanding challenge within the computer vision community. The core objective of I2V generation is to create video sequences from static images that exhibit natural dynamics while preserving the detailed information of the original image. This process has important applications in photography, filmmaking, and augmented reality. Despite significant advancements made by previous methods, maintaining spatio-temporal consistency and ensuring smooth transitions guided by textual prompts has remained a challenge.

In the paper titled "Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models," the authors introduce a novel approach to address these issues. The proposed model, Cinemo, aims to achieve superior motion controllability and stronger temporal consistency and smoothness.

Key Contributions

The paper highlights three primary contributions:

  1. Motion Residual Learning: Cinemo deviates from traditional methods that predict subsequent video frames directly. Instead, it learns the distribution of motion residuals, effectively guiding the model to generate motion dynamics that are both smooth and consistent with the input image.
  2. Motion Intensity Control: A Structural Similarity Index (SSIM)-based strategy provides fine-grained control over the intensity of motion in the generated videos. This technique allows for better alignment between the generated video and the input textual prompt, without incurring significant computational costs.
  3. Noise Refinement Using Discrete Cosine Transform (DCT): To mitigate sudden motion changes during the inference phase, the authors introduce DCTInit. This method refines the noise input using low-frequency components extracted from the input image, enabling the model to handle discrepancies between training and inference phases effectively.

Methodology

Motion Residual Learning

Cinemo's architecture leverages a foundational text-to-video (T2V) diffusion model, specifically LaVie. During training, Cinemo learns the distribution of motion residuals by incorporating appearance information from the input static image. This technique ensures that the model generates motion patterns while preserving the consistency of the input static image across frames.

SSIM-Based Motion Intensity Control

The authors propose a novel strategy that uses the SSIM to control video motion intensity. By calculating the SSIM between consecutive frames and incorporating it as a condition during training, Cinemo can produce videos with varying degrees of motion intensity that align closely with the input parameters.

DCTInit for Noise Refinement

To address the discrepancies between training and inference noise, Cinemo employs DCTInit. This method utilizes the low-frequency components of the input image's Discrete Cosine Transform to refine the inference noise, leading to smoother and more temporally consistent video generation. The choice of DCT over FFT ensures better handling of color consistency issues, which are critical for realistic video generation.

Experimental Results

The authors validate Cinemo's performance on several metrics, including Fréchet Video Distance (FVD), Inception Score (IS), Fréchet Inception Distance (FID), and CLIP similarity (CLIPSIM). The results show that Cinemo achieves state-of-the-art performance across various datasets, outperforming existing methods both qualitatively and quantitatively. Notably, Cinemo demonstrates superior image consistency and motion controllability, essential for generating high-quality animated videos.

Practical Implications

The robust performance and versatility of Cinemo have significant implications for practical applications. The ability to generate consistent, smooth, and controllable animated videos from static images can substantially enhance user experiences in diverse fields such as digital content creation, virtual reality, and augmented reality. Additionally, Cinemo's approach can be extended to video editing and motion transfer tasks, showcasing its adaptability to various video generation applications.

Future Directions

The paper suggests several potential future developments:

  1. Scaling with Transformers: Given the trend towards Transformer-based architectures in video generation, Cinemo's principles could be further validated and optimized using models like Latte.
  2. Resolution Enhancement: Improving the resolution of generated videos beyond the current limit could further enhance the model's applicability in high-definition content creation.
  3. Integration with Real-World Applications: Implementing Cinemo in practical tools and commercial products could bridge the gap between research and real-world usage, providing valuable insights for future improvements.

Conclusion

Cinemo introduces a novel and effective approach to image animation by focusing on motion residual learning and integrating innovative strategies for motion intensity control and noise refinement. The model's ability to produce highly consistent, smooth, and controllable animated videos represents a significant step forward in the field of I2V generation. The extensive quantitative and qualitative experiments demonstrate Cinemo's superiority over existing methods, paving the way for future advancements in AI-driven video generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Automatic animation of hair blowing in still portrait photos. In International Conference on Computer Vision, pages 22963–22975, 2023.
  2. Warp-guided gans for single-photo facial animation. ACM Transactions on Graphics, 37(6):1–12, 2018.
  3. Imaginator: Conditional spatio-temporal gan for video generation. In Winter Conference on Applications of Computer Vision, pages 1160–1169, 2020.
  4. Latent image animator: Learning to animate images via latent space navigation. In International Conference on Learning Representations, 2022.
  5. Blowing in the wind: Cyclenet for human cinemagraphs from still images. In Computer Vision and Pattern Recognition, pages 459–468, 2023.
  6. Understanding object dynamics for interactive image-to-video synthesis. In Computer Vision and Pattern Recognition, pages 5171–5181, 2021.
  7. Motion representations for articulated animation. In Computer Vision and Pattern Recognition, pages 13653–13662, 2021.
  8. Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:2305.03989, 2023.
  9. High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  10. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024.
  11. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
  12. Magic3d: High-resolution text-to-3d content creation. In Computer Vision and Pattern Recognition, pages 300–309, 2023.
  13. Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations, 2023.
  14. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In International Conference on Learning Representations, 2024.
  15. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  16. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  17. Align your latents: High-resolution video synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  18. Videofusion: Decomposed diffusion models for high-quality video generation. In Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  19. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  20. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In International Conference on Learning Representations, 2024.
  21. Pia: Your personalized image animator via plug-and-play modules in text-to-image models. In Computer Vision and Pattern Recognition, 2024.
  22. Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pages arXiv–2311, 2023.
  23. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  24. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.
  25. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.
  26. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
  27. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  29. Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024.
  30. Dreamvideo: High-fidelity image-to-video generation with image retention and text guidance. arXiv preprint arXiv:2312.03018, 2023.
  31. Seine: Short-to-long video diffusion model for generative transition and prediction. In International Conference on Learning Representations, 2023.
  32. Vdt: General-purpose video diffusion transformers via mask modeling. In International Conference on Learning Representations, 2023.
  33. Optical flow and scene flow estimation: A survey. Pattern Recognition, 114:107861, 2021.
  34. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  35. Freeu: Free lunch in diffusion u-net. In Computer Vision and Pattern Recognition, 2024.
  36. Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537, 2023.
  37. Input perturbation reduces exposure bias in diffusion models. arXiv preprint arXiv:2301.11706, 2023.
  38. Denoising diffusion probabilistic models. In Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
  39. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  40. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  41. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. Pmlr, 2021.
  42. Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems, 35:36479–36494, 2022.
  43. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  45. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In International Conference on Computer Vision, pages 7623–7633, 2023.
  46. Preserve your own correlation: A noise prior for video diffusion models. In International Conference on Computer Vision, pages 22930–22941, 2023.
  47. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  48. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  49. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
  50. Video diffusion models. Neural Information Processing Systems, 35:8633–8646, 2022.
  51. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  52. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022.
  53. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  54. Stochastic image-to-video synthesis using cinns. In Computer Vision and Pattern Recognition, pages 3742–3753, 2021.
  55. A phase-based approach for animating images using video examples. In Computer Graphics Forum, volume 36, pages 303–311. Wiley Online Library, 2017.
  56. F3a-gan: Facial flow for face animation with generative adversarial networks. IEEE Transactions on Image Processing, 30:8658–8670, 2021.
  57. Animegan: A novel lightweight gan for photo animation. In Artificial Intelligence Algorithms and Applications, pages 242–256. Springer, 2020.
  58. Ganimation: Anatomically-aware facial animation from a single image. In European Conference on Computer Vision, pages 818–833, 2018.
  59. I2v-adapter: A general image-to-video adapter for video diffusion models. arXiv preprint arXiv:2312.16693, 2023.
  60. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  61. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
  62. Magicanimate: Temporally consistent human image animation using diffusion model. In Computer Vision and Pattern Recognition, 2024.
  63. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  64. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  65. Common diffusion noise schedules and sample steps are flawed. In Winter Conference on Applications of Computer Vision, pages 5404–5411, 2024.
  66. Frozen in time: A joint video and image encoder for end-to-end retrieval. In International Conference on Computer Vision, pages 1728–1738, 2021.
  67. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision, pages 3836–3847, 2023.
  68. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  69. Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024.
  70. Msr-vtt: A large video description dataset for bridging video and language. In Computer Vision and Pattern Recognition, pages 5288–5296, 2016.
  71. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2(11):1–7, 2012.
  72. Fvd: A new metric for video generation. In International Conference on Learning Representations Workshop, 2019.
  73. Temporal generative adversarial nets with singular value clipping. In International Conference on Computer Vision, pages 2830–2839, 2017.
  74. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222, 5:14, 2021.
  75. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  76. Edict: Exact diffusion inversion via coupled transformations. In Computer Vision and Pattern Recognition, pages 22532–22541, 2023.
  77. Plug-and-play diffusion features for text-driven image-to-image translation. In Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xin Ma (105 papers)
  2. Yaohui Wang (50 papers)
  3. Xinyuan Chen (48 papers)
  4. Yuan-Fang Li (90 papers)
  5. Cunjian Chen (21 papers)
  6. Yu Qiao (563 papers)
  7. Gengyun Jia (5 papers)
Citations (2)