Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis (2312.17681v1)

Published 29 Dec 2023 in cs.CV and cs.MM

Abstract: Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. Text2live: Text-driven layered image and video editing. In European conference on computer vision, pages 707–723. Springer, 2022.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  4. John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  5. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
  6. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
  7. Medm: Mediating image diffusion models for video-to-video translation with temporal correspondence guidance. arXiv preprint arXiv:2308.10079, 2023a.
  8. Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models. arXiv preprint arXiv:2305.19193, 2023b.
  9. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  10. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  11. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  12. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  16. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  17. Video diffusion models. arXiv:2204.03458, 2022b.
  18. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073, 2023.
  19. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  20. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  21. Stylizing video by example. ACM Transactions on Graphics (TOG), 38(4):1–11, 2019.
  22. Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG), 40(6):1–12, 2021.
  23. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  24. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  25. Shape-aware text-driven layered video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14317–14326, 2023.
  26. Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  29. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  30. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  31. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023.
  32. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  33. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  34. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  37. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  39. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  40. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  41. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  42. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  43. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023a.
  44. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  45. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  46. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
  47. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  48. Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
  49. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023a.
  50. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023b.
  51. Controlvideo: Adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098, 2023.
Citations (20)

Summary

  • The paper introduces FlowVid, a framework that edits a single frame and propagates these changes to achieve temporally consistent video synthesis.
  • The paper leverages encoded optical flow as a supplementary reference to overcome imperfections and accelerate video generation by up to 10.5 times.
  • The paper validates FlowVid on diverse tasks like stylization and object swaps, delivering high-resolution outputs while managing limitations with rapid motion and occlusions.

Introduction

The proliferation of diffusion models in image synthesis has now begun to extend into the field of videos. While remarkable strides have been made in image-to-image (I2I) synthesis, challenges in video-to-video (V2V) synthesis persist, particularly when it comes to maintaining temporal continuity across multiple frames. To tackle this, a new framework called FlowVid has been introduced for consistent V2V synthesis that effectively leverages both spatial conditions and optical flow information in source videos.

Harnessing Optical Flow

Most existing methods rely heavily on optical flow to maintain temporal consistency, but they falter when faced with imperfections in flow estimation. FlowVid, however, adopts a different strategy, encoding flow information for use as a supplementary reference. This approach allows the creators to edit the first video frame and propagate those changes to following frames without being overly constrained by flow accuracy. The model exhibits strengths such as flexibility in editing, efficiency in video generation, and high-quality output preferred by users in studies.

Framework Details

FlowVid operates on the general principle of first editing a single frame within any current I2I model, followed by the diffusion of those edits across subsequent frames. It is compatible with existing I2I models, allowing for various modifications including stylization, object swaps, and local edits. A key feature of FlowVid is its decoupled edit-propagate design that facilitates the generation of lengthy videos using an autoregressive mechanism. It also demonstrates a significant improvement in speed, being able to generate 120 frames of a video in as little as 1.5 minutes, outstripping similar technologies by a factor ranging from 3.1 to 10.5 times.

Comparative Results and Limitations

FlowVid has been extensively tested against other contemporary methods and, importantly, displays notable advantages in terms of efficiency and the quality of video synthesis. It is favored in user comparisons and can quickly produce high-resolution videos, emphasizing its robustness and superiority in producing coherent video segments. Nonetheless, FlowVid's effectiveness can be curtailed when dealing with a misaligned initial frame or significant occlusions due to rapid motion within a video.

Conclusion

FlowVid introduces a promising approach for V2V synthesis that addresses the principal challenge of temporal consistency. By innovatively combining spatial conditions with imperfect optical flows, FlowVid showcases the potential of this method in creating videos that are not only visually coherent but also closely stick to the target prompts provided by users. Despite some evident limitations, the provided framework paves the way for more explorations in the optimization and utility of video synthesis technologies.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com