Papers
Topics
Authors
Recent
2000 character limit reached

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

Published 30 Nov 2017 in cs.CV | (1712.00080v2)

Abstract: Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. We use 1,132 video clips with 240-fps, containing 300K individual video frames, to train our network. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.

Citations (744)

Summary

  • The paper introduces a dual-stage CNN that leverages bidirectional optical flow and soft visibility maps to generate high-quality intermediate frames.
  • It employs U-Net based architectures for both flow computation and refinement, achieving state-of-the-art PSNR and SSIM on benchmarks like UCF101 and Middlebury.
  • The approach enhances slow-motion video production and dynamic scene analysis by effectively handling occlusions and motion boundaries.

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

The paper "Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation," presents a novel approach to video frame interpolation, which significantly advances the quality and applicability of creating slow-motion sequences from standard video footage. This research centers around a convolutional neural network (CNN) framework designed to handle variable-length multi-frame video interpolation, integrating motion interpretation and occlusion reasoning into a unified model.

Summary and Key Contributions

The authors propose a two-stage end-to-end trainable CNN that consists of a flow computation network and a flow interpolation network:

  1. Flow Computation Network: This component employs a U-Net architecture to estimate bidirectional optical flow between two input frames. This step provides a robust initial approximation of the intermediate optical flow fields.
  2. Flow Interpolation Network: Another U-Net architecture refines these initial flow estimates and predicts soft visibility maps to mitigate artifacts, particularly around motion boundaries. These visibility maps are instrumental in dealing with occlusions, ensuring that pixels occluded in either of the input frames do not adversely affect the interpolated results.

Key technical strategies include:

  • Utilization of bi-directional optical flow and a linear combination approach to approximate intermediate flows.
  • Refinement of these approximated flows through a second CNN that also predicts visibility maps to handle occlusions effectively.
  • Adaptive fusion of input images via warped images, facilitated by visibility maps to exclude occluded pixel contributions.

Experimental Setup and Results

The network was trained using a substantial dataset comprising 1,132 video clips (approximately 300K frames) sourced from both YouTube and hand-held cameras, ensuring a diverse mix of scenes and motion types. The network's performance was rigorously tested across multiple independent datasets including Middlebury, UCF101, slowflow, and high-frame-rate Sintel sequences, achieving state-of-the-art results in most instances.

Some of the numerical results are highlighted below:

  • UCF101 dataset: Achieved a PSNR of 33.14 and an SSIM of 0.938, outperforming several existing approaches such as Phase-Based interpolation and SepConv.
  • Middlebury benchmark: The approach attained the best interpolation error scores on 6 out of 8 sequences.
  • Slowflow dataset: Demonstrated significant PSNR and SSIM improvements over baseline methods like FlowNet2 and Phase-Based interpolation.
  • High-frame-rate Sintel dataset: Substantially outperformed existing methods by generating 31 intermediate frames and achieving superior PSNR and SSIM scores.

Theoretical and Practical Implications

The practical implications of this research are profound:

  • The system provides a means to generate high-quality slow-motion video from regular footage, significantly enhancing applications in sports analysis, movie production, and scientific studies where precise motion analysis is imperative.
  • The model's ability to handle multiple frame interpolations in parallel is particularly relevant for video encoding and transmission efficiencies, where temporal resolution can be adaptively altered based on network conditions.

From a theoretical standpoint, this work:

  • Advances the understanding of jointly modeling motion interpretation and occlusion reasoning within CNN frameworks.
  • Demonstrates the feasibility and benefits of training models on large-scale, diverse video datasets to enhance generalization across various video interpolation tasks.
  • Proposes a novel usage of soft visibility maps in video interpolation tasks, opening new avenues for exploring occlusion handling in dynamic scenes.

Future Directions

Speculatively, future developments could explore:

  • Extending the framework to handle higher resolutions and frame rates using more advanced and efficient network architectures, potentially leveraging transformer-based models.
  • Investigating the applicability of this approach in other domains such as autonomous driving (for predicting and extrapolating potential hazards) or augmented reality systems (where ultra-smooth transitions are crucial).
  • Enhancing the model's robustness and generalization by incorporating additional modalities such as depth or time-of-flight data, to aid in more accurate motion and occlusion predictions.

In conclusion, the "Super SloMo" framework represents a significant advancement in video interpolation, delivering high-quality intermediate frame generation through sophisticated CNN-based approaches. Its robustness and superior performance across various datasets underscore its potential to become a mainstream solution in both commercial and research applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.