Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance (2308.10079v3)

Published 19 Aug 2023 in cs.CV

Abstract: This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In International Conference on Machine Learning (ICML).
  2. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  3. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  4. A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision (ECCV), 611–625.
  5. Virtual KITTI 2. arXiv preprint arXiv:2001.10773.
  6. Pix2Video: Video Editing using Image Diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  7. Learning to generate line drawings that convey geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  8. Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. arXiv preprint arXiv:2305.13840.
  9. Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models. arXiv preprint arXiv:2305.19193.
  10. TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  11. TokenFlow: Consistent Diffusion Features for Consistent Video Editing. arXiv preprint arxiv:2307.10373.
  12. Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories. In Proceedings of the European Conference on Computer Vision (ECCV), 59–75.
  13. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv preprint arXiv:2210.02303.
  14. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems (NeurIPS), 6840–6851.
  15. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  16. DeepPrivacy: A Generative Adversarial Network for Face Anonymization. In Advances in Visual Computing, 565–578.
  17. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  18. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761.
  19. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations (ICLR).
  20. CoDeF: Content Deformation Fields for Temporally Consistent Video Processing. arXiv preprint arXiv:2308.07926.
  21. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  22. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  23. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125.
  24. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  25. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems (NeurIPS), 36479–36494.
  26. Edit-A-Video: Single Video Editing with Object-Aware Consistency. In Proceedings of the 15th Asian Conference on Machine Learning.
  27. Make-A-Video: Text-to-Video Generation without Text-Video Data. In International Conference on Learning Representations (ICLR).
  28. Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR).
  29. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR).
  30. Tracking Everything Everywhere All at Once. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  31. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  32. LAMP: Learn a Motion Pattern by Few-Shot Tuning a Text-to-Image Diffusion Model. arXiv preprint arXiv:2310.10769.
  33. Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance. arXiv preprint arXiv:2306.00943.
  34. Unifying Flow, Stereo and Depth Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  35. Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. In ACM SIGGRAPH Asia Conference Proceedings.
  36. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  37. ControlVideo: Training-free Controllable Text-to-Video Generation. arXiv preprint arXiv:2305.13077.
  38. CelebV-HQ: A Large-Scale Video Facial Attributes Dataset. In Proceedings of the European Conference on Computer Vision (ECCV), 650–667.
Citations (11)

Summary

We haven't generated a summary for this paper yet.