Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models (2403.17005v1)

Published 25 Mar 2024 in cs.CV and cs.MM

Abstract: Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal modeling via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In ICCV, 2021.
  2. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In CVPR, 2023.
  3. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR, 2023.
  4. StableVideo: Text-driven Consistency-aware Diffusion Video Editing. In ICCV, 2023.
  5. AnchorFormer: Point Cloud Completion from Discriminative Nodes. In CVPR, 2023.
  6. VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing. arXiv preprint arXiv:2306.08707, 2023.
  7. Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2022.
  8. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning. arXiv preprint arXiv:2211.11337, 2022.
  9. Stochastic Image-to-Video Synthesis using cINNs. In CVPR, 2021.
  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  11. Animating Landscape: Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis. TOG, 2019.
  12. Simulating Fluids in Real-World Still Images. In ICCV, 2023.
  13. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2023.
  14. Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models. In ICCV, 2023.
  15. TokenFlow: Consistent Diffusion Features for Consistent Video Editing. arXiv preprint arXiv:2307.10373, 2023.
  16. Generative Adversarial Networks. In NeurIPS, 2014.
  17. Seer: Language Instructed Video Prediction with Latent Diffusion Models. arXiv preprint arXiv:2303.14897, 2023.
  18. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. arXiv preprint arXiv:2307.04725, 2023.
  19. Flexible Diffusion Modeling of Long Videos. In NeurIPS, 2022.
  20. Deep Residual Learning for Image Recognition. In CVPR, 2016.
  21. Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation. arXiv preprint arXiv:2307.06940, 2023.
  22. Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR, 2023.
  23. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeuIPS, 2017.
  24. Denoising Diffusion Probabilistic Models. In NeurIPS, 2020.
  25. Imagen Video: High Definition Video Generation with Diffusion Models. In CVPR, 2022a.
  26. Video Diffusion Models. In NeurIPS, 2022b.
  27. Animating Pictures with Eulerian Motion Fields. In CVPR, 2021.
  28. CogVideo: Large-Scale Pretraining for Text-to-Video Generation via Transformers. In ICLR, 2023.
  29. Densely Connected Convolutional Networks. In CVPR, 2017.
  30. A New Large Scale Dynamic Texture Dataset with Application to Convnet Understanding. In ECCV, 2018.
  31. DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion. In ICCV, 2023.
  32. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. In ICCV, 2023.
  33. Multi-Concept Customization of Text-to-Image Diffusion. In CVPR, 2023.
  34. Flow-Grounded Spatial-Temporal Video Prediction from Still Images. In ECCV, 2018.
  35. Contextual Transformer Networks for Visual Recognition. IEEE Trans. on PAMI, 2022.
  36. Stand-Alone Inter-Frame Attention in Video Models. In CVPR, 2022a.
  37. Dynamic Temporal Filtering in Video Models. In ECCV, 2022b.
  38. VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM. arXiv:2401.01256, 2024.
  39. VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation. In CVPR, 2023.
  40. Controllable Animation of Fluid Elements in Still Images. In CVPR, 2022.
  41. Stochastic Variational Video Prediction. In ICLR, 2018.
  42. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453, 2023.
  43. Conditional Image-to-Video Generation with Latent Flow Diffusion Models. In CVPR, 2023.
  44. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In ICML, 2022.
  45. CoDeF: Content Deformation Fields for Temporally Consistent Video Processing. arXiv preprint arXiv:2308.07926, 2023.
  46. Zero-shot Image-to-Image Translation. In ACM SIGGRAPH, 2023.
  47. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv preprint arXiv:2307.01952, 2023.
  48. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
  49. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 2022.
  50. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
  51. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In CVPR, 2023.
  52. Make-a-Video: Text-to-Video Generation without Text-Video Data. In ICLR, 2023.
  53. Denoising Diffusion Implicit Models. In ICLR, 2021.
  54. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR, 2018.
  55. FVD: A new Metric for Video Generation. In ICLR DeepGenStruct Workshop, 2019.
  56. Attention Is All You Need. In NeurIPS, 2017.
  57. Diffusers: State-Of-The-Art Diffusion Models.
  58. Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising. arXiv preprint arXiv:2305.18264, 2023a.
  59. ModelScope Text-to-Video Technical Report. arXiv preprint arXiv:2308.06571, 2023b.
  60. VideoComposer: Compositional Video Synthesis with Motion Controllability. In NeurIPS, 2023c.
  61. Latent Image Animator: Learning to Animate Images via Latent Space Navigation. In ICLR, 2022.
  62. LEO: Generative Latent Image Animator for Human Video Synthesis. arXiv preprint arXiv:2305.03989, 2023d.
  63. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. In ICCV, 2023.
  64. Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks. In CVPR, 2018.
  65. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 2016.
  66. Understanding and Improving Layer Normalization. In NeurIPS, 2019.
  67. MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation. arXiv preprint arXiv:2309.00908, 2023.
  68. Diffusion Probabilistic Modeling for Video Generation. arXiv preprint arXiv:2203.09481, 2022.
  69. Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning. In ECCV, 2022.
  70. DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory. arXiv preprint arXiv:2308.08089, 2023.
  71. MagicAvatar: Multimodal Avatar Generation and Animation. arXiv preprint arXiv:2308.14748, 2023a.
  72. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023b.
  73. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhongwei Zhang (36 papers)
  2. Fuchen Long (13 papers)
  3. Yingwei Pan (77 papers)
  4. Zhaofan Qiu (37 papers)
  5. Ting Yao (127 papers)
  6. Yang Cao (295 papers)
  7. Tao Mei (209 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.