FILM: Frame Interpolation for Large Motion (2202.04901v4)

Published 10 Feb 2022 in cs.CV

Abstract: We present a frame interpolation algorithm that synthesizes multiple intermediate frames from two input images with large in-between motion. Recent methods use multiple networks to estimate optical flow or depth and a separate network dedicated to frame synthesis. This is often complex and requires scarce optical flow or depth ground-truth. In this work, we present a single unified network, distinguished by a multi-scale feature extractor that shares weights at all scales, and is trainable from frames alone. To synthesize crisp and pleasing frames, we propose to optimize our network with the Gram matrix loss that measures the correlation difference between feature maps. Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark. We also achieve higher scores on Vimeo-90K, Middlebury and UCF101, when comparing to methods that use perceptual losses. We study the effect of weight sharing and of training with datasets of increasing motion range. Finally, we demonstrate our model's effectiveness in synthesizing high quality and temporally coherent videos on a challenging near-duplicate photos dataset. Codes and pre-trained models are available at https://film-net.github.io.

Citations (127)

View on Semantic Scholar

Summary

The paper introduces a unified architecture that employs scale-agnostic motion estimation to simplify training and improve frame interpolation across diverse motion scales.
The method uses a Gram matrix loss to enhance sharpness and reduce blurriness in synthesized frames, addressing common challenges in wide disocclusion areas.
Empirical results show that FILM outperforms state-of-the-art methods on large motion benchmarks while maintaining efficient inference and competitive performance on standard datasets.

FILM: Frame Interpolation for Large Motion

The paper "FILM: Frame Interpolation for Large Motion" presents a novel approach to frame interpolation, specifically addressing the challenge of synthesizing intermediate frames in scenarios characterized by large motion. The authors introduce a unified, single-stage neural network architecture, leveraging shared weight feature pyramids and a scale-agnostic bi-directional motion estimator to enhance motion interpolation across varying scales.

Key Contributions

Scale-Agnostic Motion Estimation: The paper proposes a motion estimator that operates effectively across multiple scales by sharing weights in the feature pyramid. This technique increases pixel availability for supervision, and allows the model to handle both small and large motion scenarios.
Unified Architecture: Unlike other interpolation methods that rely on pre-trained optical flow or depth networks, FILM is trained directly from frame triplets. This significantly simplifies the pipeline and reduces the dependency on scarce pre-training data, making it particularly suitable for applications involving large motion.
Use of Gram Matrix Loss: To address the common issue of blurriness in interpolated frames, especially in wide disocclusion areas, the authors employ a Gram matrix-based loss function. This enhances the sharpness and realism of the synthesized frames by focusing on feature correlation differences.

Results

The FILM approach outperforms state-of-the-art methods on the large motion benchmark Xiph, while maintaining competitive performance on other standard datasets such as Vimeo-90K, Middlebury, and UCF101. It not only achieves high benchmark scores for color losses but also significantly improves perceptual quality using a style loss that combines perceptual and Gram matrix losses. FILM demonstrates a faster inference time compared to ABME and SoftSplat, with only a modest increase in memory usage, making it efficient for practical applications.

Implications and Future Research

The implications of this work are substantial for video synthesis, particularly in applications like digital photography, where near-duplicate photos can be converted to visually engaging slow-motion videos. The methodology offers a simplified model that can handle diverse motion magnitudes without the need for complex pre-training stages.

Future research could explore the integration of this approach with real-time video processing, enhancing its applicability to live media. There is also potential for extending this model to other domains where temporal interpolation is beneficial, such as virtual reality and augmented reality.

Overall, the paper provides meaningful advancements in frame interpolation, mainly by addressing large motion, which has been a challenging aspect for existing methods. The introduction of scale-agnostic feature sharing and the innovative use of the Gram matrix loss function could inspire further studies to improve the quality and applicability of video synthesis technologies.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/quoposk/status/1884511802723295715

YouTube

Show All Videos