Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation (2403.02827v1)
Abstract: Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains. Traditional image animation techniques primarily focus on specific domains such as faces or human poses, making them difficult to generalize to open domains. Several recent I2V frameworks based on diffusion models can generate dynamic content for open domain images but fail to maintain fidelity. We found that two main factors of low fidelity are the loss of image details and the noise prediction biases during the denoising process. To this end, we propose an effective method that can be applied to mainstream video diffusion models. This method achieves high fidelity based on supplementing more precise image information and noise rectification. Specifically, given a specified image, our method first adds noise to the input image latent to keep more details, then denoises the noisy latent with proper rectification to alleviate the noise prediction biases. Our method is tuning-free and plug-and-play. The experimental results demonstrate the effectiveness of our approach in improving the fidelity of generated videos. For more image-to-video generated results, please refer to the project website: https://noise-rectification.github.io.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575. IEEE, 2023.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
- Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023b.
- ILVR: conditioning method for denoising diffusion probabilistic models. In ICCV, pages 14347–14356. IEEE, 2021.
- Diffusion models beat gans on image synthesis. In NIPS, pages 8780–8794, 2021.
- Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
- Stochastic latent residual video prediction. In ICML, pages 3233–3246. PMLR, 2020.
- Generative adversarial nets. NIPS, 27, 2014.
- Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Latent video diffusion models for high-fidelity video generation. arXiv preprint arXiv:2211.13221, 2022.
- Denoising diffusion probabilistic models. In NIPS, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. In NIPS, 2022b.
- Animating pictures with eulerian motion fields. In CVPR, pages 5810–5819, 2021.
- Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
- Make it move: controllable image-to-video generation with text descriptions. In CVPR, pages 18219–18228, 2022.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Auto-encoding variational bayes. In ICLR, 2014.
- Ccvs: context-aware controllable video synthesis. NIPS, 34:14042–14055, 2021.
- Flow-grounded spatial-temporal video prediction from still images. In ECCV, pages 600–615, 2018.
- Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
- Videofusion: Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:2303.08320, 2023.
- Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
- Controllable animation of fluid elements in still images. In CVPR, pages 3667–3676, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR. OpenReview.net, 2022.
- Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047. IEEE, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Hotshot-XL, 2023.
- Conditional image-to-video generation with latent flow diffusion models. In CVPR, pages 18444–18455, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NIPS, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- First order motion model for image animation. NIPS, 32, 2019.
- Make-a-video: Text-to-video generation without text-video data. In ICLR. OpenReview.net, 2023.
- A method for automatically animating children’s drawings of the human figure. arXiv preprint arXiv:2303.12741, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NIPS, 35:23371–23385, 2022.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, pages 10039–10049, 2021.
- Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
- Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arxiv:2308.06721, 2023.
- Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023a.
- NUWA-XL: diffusion over diffusion for extremely long video generation. In ACL, pages 1309–1320. Association for Computational Linguistics, 2023b.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
- Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023c.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Weijie Li (30 papers)
- Litong Gong (4 papers)
- Yiran Zhu (13 papers)
- Fanda Fan (8 papers)
- Biao Wang (93 papers)
- Tiezheng Ge (46 papers)
- Bo Zheng (205 papers)