Video Interpolation with Diffusion Models (2404.01203v1)

Published 1 Apr 2024 in cs.CV

Abstract: We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.

References (63)

Citations (22)

View on Semantic Scholar

Summary

The paper introduces VIDIM, a novel generative approach that leverages cascaded diffusion models for video interpolation.
It employs a two-step process with a base diffusion model for low-resolution generation followed by super-resolution refinement using temporal attention.
Empirical evaluations show VIDIM achieves superior performance over state-of-the-art methods on benchmarks like FVD, validating its effectiveness in complex motion scenarios.

Exploring the Frontier of Video Interpolation with VIDIM: A Generative Approach

Introduction

Video interpolation involves creating intermediate frames between two consecutive frames of a video, aiming to either increase the frame rate or generate slow-motion videos. Traditional methods have largely relied on linear motion estimations or optical flow algorithms, which often struggle with complex, non-linear motions or ambiguous scenarios. In this paper, we introduce Video Interpolation with Diffusion Models (VIDIM), a novel generative approach that leverages cascaded diffusion models to tackle these challenges head-on. VIDIM significantly outperforms existing state-of-the-art methods in handling complex and ambiguous motion, generating high-quality, plausible videos even in the toughest scenarios.

Methodology

Cascaded Diffusion Models for Video Generation

VIDIM's architecture employs a two-step generative process. Initially, it generates the target video at a lower resolution using a base diffusion model conditioned on start and end frames. Subsequently, a super-resolution model conditioned on this low-resolution video and the original high-resolution frames synthesizes the final high-resolution video. This cascaded approach, inspired by previous successes in the field, ensures that VIDIM can capture fine details and maintain temporal consistency across frames.

Architectural Innovations and Training Regimen

The paper introduces several key innovations in the model's architecture and training process. Notably, VIDIM uses a UNet architecture adapted for video by permitting mixing of feature maps across frames through temporal attention blocks. Furthermore, it incorporates a novel technique for frame conditioning that involves setting fake noise levels for the conditioning frames, enabling information from these frames to propagate through the network without extra parameters. The models employ classifier-free guidance to dramatically enhance sample quality, a critical factor in achieving realistic video interpolation results.

During training, VIDIM models are optimized using a continuous-time objective based on the evidence lower bound (ELBO), with adjustments for video-specific dynamics. Training leverages large-scale video datasets, with procedures in place to filter out undesirable examples, such as those with rapid scene cuts, ensuring that the models learn from relevant data.

Empirical Evaluation

Benchmarking Against State-of-the-Art

VIDIM's performance was extensively evaluated against several state-of-the-art video interpolation methods across challenging datasets derived from the Davis and UCF101 collections. The evaluation focused on both generative metrics, such as Frechét Video Distance (FVD), and traditional reconstruction-based metrics. VIDIM consistently outshone the baseline models, especially in scenarios characterized by large and ambiguous motion, validating its superior capability to generate plausible and temporally consistent videos.

User Study

A user paper involving video quadruplets generated from the same input frame pairs accentuated VIDIM's advantages. Participants overwhelmingly preferred VIDIM-generated videos over those produced by baseline models, underlining its effectiveness in producing high-quality, realistic videos even under difficult conditions.

Ablations and Further Insights

The paper carried out ablations to dissect the contributions of various components, particularly highlighting the importance of explicit frame conditioning and classifier-free guidance in achieving optimal results. Scalability tests further demonstrated VIDIM's capacity to improve with larger models, though balancing the parameter count in both base and super-resolution models was crucial for maximizing quality.

Conclusion and Future Directions

VIDIM represents a significant advancement in video interpolation, notably for scenarios that have historically posed challenges for generative models. By leveraging cascaded diffusion models and novel architectural tweaks, VIDIM sets new standards for video interpolation quality. Future work might explore its application to other video generation tasks, extend its capabilities to arbitrary aspect ratios, or further refine super-resolution models to enhance quality. The findings promise exciting developments in video processing and generative modeling, paving the way for more realistic and complex video generation tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bilawalsidhu/status/1775738248549077286

https://twitter.com/jannekontkanen/status/1775314251676332466

https://twitter.com/Sichinain/status/1775397965903143281

https://twitter.com/fly51fly/status/1777092026481410424

https://twitter.com/fffiloni/status/1775454015116956002

https://twitter.com/arxivsanitybot/status/1775878326818935193

Reddit

[2404.01203] Video Interpolation with Diffusion Models (1 point, 0 comments)