Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation (2403.02827v1)

Published 5 Mar 2024 in cs.CV

Abstract: Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains. Traditional image animation techniques primarily focus on specific domains such as faces or human poses, making them difficult to generalize to open domains. Several recent I2V frameworks based on diffusion models can generate dynamic content for open domain images but fail to maintain fidelity. We found that two main factors of low fidelity are the loss of image details and the noise prediction biases during the denoising process. To this end, we propose an effective method that can be applied to mainstream video diffusion models. This method achieves high fidelity based on supplementing more precise image information and noise rectification. Specifically, given a specified image, our method first adds noise to the input image latent to keep more details, then denoises the noisy latent with proper rectification to alleviate the noise prediction biases. Our method is tuning-free and plug-and-play. The experimental results demonstrate the effectiveness of our approach in improving the fidelity of generated videos. For more image-to-video generated results, please refer to the project website: https://noise-rectification.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575. IEEE, 2023.
  4. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  5. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023b.
  6. ILVR: conditioning method for denoising diffusion probabilistic models. In ICCV, pages 14347–14356. IEEE, 2021.
  7. Diffusion models beat gans on image synthesis. In NIPS, pages 8780–8794, 2021.
  8. Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
  9. Stochastic latent residual video prediction. In ICML, pages 3233–3246. PMLR, 2020.
  10. Generative adversarial nets. NIPS, 27, 2014.
  11. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023.
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  13. Latent video diffusion models for high-fidelity video generation. arXiv preprint arXiv:2211.13221, 2022.
  14. Denoising diffusion probabilistic models. In NIPS, 2020.
  15. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  16. Video diffusion models. In NIPS, 2022b.
  17. Animating pictures with eulerian motion fields. In CVPR, pages 5810–5819, 2021.
  18. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  19. Make it move: controllable image-to-video generation with text descriptions. In CVPR, pages 18219–18228, 2022.
  20. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
  21. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  22. Auto-encoding variational bayes. In ICLR, 2014.
  23. Ccvs: context-aware controllable video synthesis. NIPS, 34:14042–14055, 2021.
  24. Flow-grounded spatial-temporal video prediction from still images. In ECCV, pages 600–615, 2018.
  25. Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
  26. Videofusion: Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:2303.08320, 2023.
  27. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
  28. Controllable animation of fluid elements in still images. In CVPR, pages 3667–3676, 2022.
  29. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR. OpenReview.net, 2022.
  30. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047. IEEE, 2023.
  31. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  32. Hotshot-XL, 2023.
  33. Conditional image-to-video generation with latent flow diffusion models. In CVPR, pages 18444–18455, 2023.
  34. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  35. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  36. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  37. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
  38. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  39. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  40. Photorealistic text-to-image diffusion models with deep language understanding. In NIPS, 2022.
  41. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  42. First order motion model for image animation. NIPS, 32, 2019.
  43. Make-a-video: Text-to-video generation without text-video data. In ICLR. OpenReview.net, 2023.
  44. A method for automatically animating children’s drawings of the human figure. arXiv preprint arXiv:2303.12741, 2023.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  46. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  47. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  48. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NIPS, 35:23371–23385, 2022.
  49. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  50. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, pages 10039–10049, 2021.
  51. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  52. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
  53. Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023.
  54. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arxiv:2308.06721, 2023.
  55. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023a.
  56. NUWA-XL: diffusion over diffusion for extremely long video generation. In ACL, pages 1309–1320. Association for Computational Linguistics, 2023b.
  57. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  58. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
  59. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023c.
  60. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Weijie Li (30 papers)
  2. Litong Gong (4 papers)
  3. Yiran Zhu (13 papers)
  4. Fanda Fan (8 papers)
  5. Biao Wang (93 papers)
  6. Tiezheng Ge (46 papers)
  7. Bo Zheng (205 papers)
Citations (1)

Summary

  • The paper introduces a tuning-free, plug-and-play noise rectification strategy that preserves fine image details during denoising and enhances video fidelity.
  • It refines the latent noise representation without model retraining, ensuring high-quality generation while maintaining computational efficiency.
  • Experimental results demonstrate its superior performance over traditional techniques, paving the way for scalable, open-domain image-to-video generation.

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation

This paper addresses the Image-to-Video (I2V) generation task, which is a prominent challenge in maintaining high fidelity across open domains. Traditional image animation techniques are often limited to specific domains and struggle to adapt to open-domain scenarios. This has led to a growing interest in leveraging diffusion models in I2V frameworks. However, maintaining fidelity while generating dynamic content remains a significant obstacle, primarily due to the loss of image details and noise prediction biases during the denoising process.

To tackle these issues, the authors present a tuning-free, plug-and-play method that enhances the fidelity of I2V generation through a two-fold strategy. The method first involves adding noise to the latent representation of a specified image to retain details. It then applies a principled noise rectification during the denoising process to correct biases, ensuring more precise image detail retention.

The proposed noise rectification strategy is a noteworthy aspect of this paper. The technique draws inspiration from noise vector refinement approaches seen in recent image editing work. Unlike traditional methods that require extensive tuning or retraining of models, this approach is embedding-free and directly applicable to pre-existing video diffusion models. It is especially noteworthy that this method enhances fidelity without compromising computational efficiency, which represents an effective balance between high-quality image translation and practical implementation.

Experimental results highlight this method’s efficacy, demonstrating improved fidelity in generated videos compared to existing I2V methods. Notably, this technique does not necessitate additional training and can be seamlessly integrated with current mainstream diffusion models. The paper effectively positions the proposed method as an advantageous tool for high-fidelity I2V generation, capturing fine details and maintaining dynamic coherence without the common drawbacks of increased computational load or intricate model reconfiguration.

The implications of this research are considerable for both theoretical exploration and practical applications in AI. The method provides a new direction in enhancing video fidelity without sacrificing dynamics, a critical balance for many applications in entertainment and virtual reality. Furthermore, this work could pave the way for more generalized solutions in real-time video generation, enriching the capability of AI systems in dealing with dynamic open-domain content.

Future advancements could look into extending this approach to incorporate larger datasets and a broader range of image types, potentially leading to enriched generative models that manage conflicting demands of fidelity and dynamics even more effectively. Moreover, integrating this method into more comprehensive frameworks for multi-modal video generation may enrich the overall progress in AI-driven visual media production.

Github Logo Streamline Icon: https://streamlinehq.com