Video Frame Interpolation via Adaptive Separable Convolution (1708.01692v1)

Published 5 Aug 2017 in cs.CV

Abstract: Standard video frame interpolation methods first estimate optical flow between input frames and then synthesize an intermediate frame guided by motion. Recent approaches merge these two steps into a single convolution process by convolving input frames with spatially adaptive kernels that account for motion and re-sampling simultaneously. These methods require large kernels to handle large motion, which limits the number of pixels whose kernels can be estimated at once due to the large memory demand. To address this problem, this paper formulates frame interpolation as local separable convolution over input frames using pairs of 1D kernels. Compared to regular 2D kernels, the 1D kernels require significantly fewer parameters to be estimated. Our method develops a deep fully convolutional neural network that takes two input frames and estimates pairs of 1D kernels for all pixels simultaneously. Since our method is able to estimate kernels and synthesizes the whole video frame at once, it allows for the incorporation of perceptual loss to train the neural network to produce visually pleasing frames. This deep neural network is trained end-to-end using widely available video data without any human annotation. Both qualitative and quantitative experiments show that our method provides a practical solution to high-quality video frame interpolation.

Citations (666)

View on Semantic Scholar

Summary

The paper introduces an innovative method that harnesses adaptive separable convolution for efficient video frame interpolation, eliminating the need for separate optical flow estimation.
It employs pairs of 1D kernels within a CNN encoder-decoder framework to drastically reduce memory consumption while maintaining interpolation quality.
Quantitative and qualitative evaluations demonstrate state-of-the-art performance, particularly in high-resolution and dynamic motion scenarios.

Video Frame Interpolation via Adaptive Separable Convolution

The presented paper introduces a refined methodology for video frame interpolation, specifically employing adaptive separable convolution. This approach represents a shift from traditional techniques that typically rely on estimating optical flow between frames, often a computationally intensive and error-prone process affected by factors like occlusion or abrupt brightness variations. The authors propose a model that integrates motion estimation and frame synthesis into a singular convolution operation, dramatically optimizing performance and addressing memory constraints present in earlier methods.

Methodology Overview

The core innovation is the reformulation of frame interpolation through the use of local separable convolution. This approach utilizes pairs of 1D kernels instead of expansive 2D kernels, significantly reducing memory usage while maintaining performance. This reduction is achieved by encoding an $n \times n$ convolution kernel using only $2n$ variables. A fully convolutional neural network (CNN) is trained end-to-end with this architecture to predict these 1D kernels for each output pixel.

Network Architecture

The CNN employs an encoder-decoder framework with bilinear interpolation for upsampling to mitigate checkerboard artifacts commonly associated with alternative methods. It comprises multiple stages, utilizing skip connections to improve its representational capability. This architecture enables the simultaneous prediction of kernels for all pixels in a frame, enhancing both efficiency and accuracy.

Training and Loss Functions

The network is trained using video data without the necessity for manually annotated labels. Two loss functions are explored: an $L_1$ norm capturing per-pixel color differences, and a perceptual loss that leverages high-level features for enhancing visual quality. The latter is particularly enabled by the full-frame synthesis capability derived from using separable kernels.

Results and Implications

Experiments demonstrate that this method achieves high-quality interpolation results, comparing favorably to state-of-the-art techniques, both qualitatively and quantitatively. Notably, its capability to handle high-resolution video frames efficiently—requiring only 1.27 GB for a 1080p frame interpolation—significantly outperforms previous methods like those using 41x41 2D kernels which demand 26 GB of memory.

Quantitative and Qualitative Evaluations

Quantitative evaluations on the Middlebury benchmark and several cross-validation datasets underscore the accuracy of this model, particularly in areas with discontinuous motion or textureless regions. Qualitative assessments reveal visually appealing results, attributed to the perceptual loss specifically which enhances high-frequency detail preservation.

Future Perspectives

The implications of this method extend to practical applications where high-resolution video processing is essential, facilitating advancements in areas like video editing, anomaly detection in surveillance, and temporal super-resolution. Future research may explore handling arbitrary temporal inputs, potentially by incorporating multi-scale approaches to engage broader motion contexts.

In summary, the adaptive separable convolution method presents a significant enhancement in video frame interpolation, addressing key challenges in the field by combining methodological simplicity with computational efficiency. As AI continues to progress, these foundational innovations will guide further explorations and applications in computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos