XVFI: eXtreme Video Frame Interpolation (2103.16206v2)

Published 30 Mar 2021 in cs.CV

Abstract: In this paper, we firstly present a dataset (X4K1000FPS) of 4K videos of 1000 fps with the extreme motion to the research community for video frame interpolation (VFI), and propose an extreme VFI network, called XVFI-Net, that first handles the VFI for 4K videos with large motion. The XVFI-Net is based on a recursive multi-scale shared structure that consists of two cascaded modules for bidirectional optical flow learning between two input frames (BiOF-I) and for bidirectional optical flow learning from target to input frames (BiOF-T). The optical flows are stably approximated by a complementary flow reversal (CFR) proposed in BiOF-T module. During inference, the BiOF-I module can start at any scale of input while the BiOF-T module only operates at the original input scale so that the inference can be accelerated while maintaining highly accurate VFI performance. Extensive experimental results show that our XVFI-Net can successfully capture the essential information of objects with extremely large motions and complex textures while the state-of-the-art methods exhibit poor performance. Furthermore, our XVFI-Net framework also performs comparably on the previous lower resolution benchmark dataset, which shows a robustness of our algorithm as well. All source codes, pre-trained models, and proposed X4K1000FPS datasets are publicly available at https://github.com/JihyongOh/XVFI.

Citations (152)

View on Semantic Scholar

Summary

The paper introduces a novel dataset and network architecture designed to tackle extreme video frame interpolation challenges in high-resolution 4K videos.
The proposed XVFI-Net leverages a recursive multi-scale design with dedicated BiOF-I and BiOF-T modules to accurately estimate bidirectional optical flows.
Experimental results demonstrate that XVFI-Net outperforms existing state-of-the-art methods, achieving higher PSNR/SSIM and reduced temporal optical flow errors.

XVFI: Extreme Video Frame Interpolation

In the field of video processing, video frame interpolation (VFI) serves the critical function of converting low frame rate (LFR) videos into high frame rate (HFR) outputs. This process is essential for enhancing video quality, particularly for content with fast motion, resulting in smoother visuals and reduced motion judder. The paper entitled "XVFI: eXtreme Video Frame Interpolation" contributes to this domain by addressing the challenges presented by 4K resolution videos with significant motion. The authors introduce both a novel dataset and a dedicated network architecture optimized for handling extreme VFI tasks.

Dataset Contribution

The paper first introduces the X4K1000FPS dataset, which comprises high-quality 4K resolution video sequences captured at 1000 frames per second. This dataset stands out due to its focus on extreme motions and diverse scene dynamics, presenting a valuable resource for evaluating and developing VFI models. By offering this dataset, the paper paves the way for future research in handling the complexities associated with high-resolution video content and fast-paced motion, aspects that are less emphasized in existing LFR benchmarks.

XVFI-Net Architecture

The core technical contribution of this work is the proposed XVFI-Net, a network designed with a recursive multi-scale shared structure. This design consists of two cascaded modules:

BiOF-I Module: Responsible for bidirectional optical flow estimation between two input frames, leveraging a pyramid-like approach to capture large motions across multiple scales.
BiOF-T Module: Refines the bidirectional flow estimation from the target frame to the input frames. The flows are stabilized by employing a complementary flow reversal (CFR) method, which efficiently addresses the alignment and hole-filling issues common in flow reversal techniques.

The network demonstrates efficiency and accuracy improvements by allowing inference to start at any desired scale. This flexibility adapts to varying levels of input resolution and motion magnitude, maintaining computational efficiency without sacrificing the precision of VFI results.

Experimental Evaluation

The effectiveness of XVFI-Net is showcased through extensive experiments, outperforming existing state-of-the-art (SOTA) VFI methods, particularly on challenging 4K content. The model exhibits robustness across different resolutions, achieving superior results on standard and newly introduced datasets alike. Notably, XVFI-Net manages to interpolate frames with high structural fidelity and reduced temporal inconsistency, as evidenced by lower temporal optical flow (tOF) errors and higher PSNR/SSIM metrics.

Implications and Future Scope

Practically, the proposed XVFI-Net has significant implications for various applications, including digital television, adaptive streaming, and novel view synthesis. Theoretically, the approach could lay the groundwork for future research in handling extreme visual dynamics across dimensions of resolution and motion. The proposed dataset also urges the exploration of other high-resolution VFI challenges.

As future developments in AI and machine learning continue to push the boundaries of VFI, models like XVFI-Net could inspire additional research on leveraging deep learning architectures for complex and resource-intensive tasks in video processing. The paper's contributions, particularly in providing comprehensive datasets and efficient network designs, are poised to facilitate advancements in real-world video applications and the broader scope of artificial intelligence research in visual media.

PDF Markdown

Related Papers

GitHub

GitHub - JihyongOh/XVFI: [ICCV 2021, Oral 3%] Official repository of XVFI (299 stars)