FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis (2312.17681v1)

Published 29 Dec 2023 in cs.CV and cs.MM

Abstract: Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

References (51)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces FlowVid, a framework that edits a single frame and propagates these changes to achieve temporally consistent video synthesis.
The paper leverages encoded optical flow as a supplementary reference to overcome imperfections and accelerate video generation by up to 10.5 times.
The paper validates FlowVid on diverse tasks like stylization and object swaps, delivering high-resolution outputs while managing limitations with rapid motion and occlusions.

Introduction

The proliferation of diffusion models in image synthesis has now begun to extend into the field of videos. While remarkable strides have been made in image-to-image (I2I) synthesis, challenges in video-to-video (V2V) synthesis persist, particularly when it comes to maintaining temporal continuity across multiple frames. To tackle this, a new framework called FlowVid has been introduced for consistent V2V synthesis that effectively leverages both spatial conditions and optical flow information in source videos.

Harnessing Optical Flow

Most existing methods rely heavily on optical flow to maintain temporal consistency, but they falter when faced with imperfections in flow estimation. FlowVid, however, adopts a different strategy, encoding flow information for use as a supplementary reference. This approach allows the creators to edit the first video frame and propagate those changes to following frames without being overly constrained by flow accuracy. The model exhibits strengths such as flexibility in editing, efficiency in video generation, and high-quality output preferred by users in studies.

Framework Details

FlowVid operates on the general principle of first editing a single frame within any current I2I model, followed by the diffusion of those edits across subsequent frames. It is compatible with existing I2I models, allowing for various modifications including stylization, object swaps, and local edits. A key feature of FlowVid is its decoupled edit-propagate design that facilitates the generation of lengthy videos using an autoregressive mechanism. It also demonstrates a significant improvement in speed, being able to generate 120 frames of a video in as little as 1.5 minutes, outstripping similar technologies by a factor ranging from 3.1 to 10.5 times.

Comparative Results and Limitations

FlowVid has been extensively tested against other contemporary methods and, importantly, displays notable advantages in terms of efficiency and the quality of video synthesis. It is favored in user comparisons and can quickly produce high-resolution videos, emphasizing its robustness and superiority in producing coherent video segments. Nonetheless, FlowVid's effectiveness can be curtailed when dealing with a misaligned initial frame or significant occlusions due to rapid motion within a video.

Conclusion

FlowVid introduces a promising approach for V2V synthesis that addresses the principal challenge of temporal consistency. By innovatively combining spatial conditions with imperfect optical flows, FlowVid showcases the potential of this method in creating videos that are not only visually coherent but also closely stick to the target prompts provided by users. Despite some evident limitations, the provided framework paves the way for more explorations in the optimization and utility of video synthesis technologies.

PDF Markdown

Related Papers

GitHub

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Tweets

https://twitter.com/22146921/status/1741937550934294902

https://twitter.com/1604564870178820096/status/1742199203420586177

https://twitter.com/123543935/status/1741691144168178086

https://twitter.com/1637708085958696961/status/1742226907490640339

YouTube

Show All Videos