Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution (2002.11616v1)

Published 26 Feb 2020 in cs.CV, cs.MM, and eess.IV

Abstract: In this paper, we explore the space-time video super-resolution task, which aims to generate a high-resolution (HR) slow-motion video from a low frame rate (LFR), low-resolution (LR) video. A simple solution is to split it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR). However, temporal interpolation and spatial super-resolution are intra-related in this task. Two-stage methods cannot fully take advantage of the natural property. In addition, state-of-the-art VFI or VSR networks require a large frame-synthesis or reconstruction module for predicting high-quality video frames, which makes the two-stage methods have large model sizes and thus be time-consuming. To overcome the problems, we propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video. Rather than synthesizing missing LR video frames as VFI networks do, we firstly temporally interpolate LR frame features in missing LR video frames capturing local temporal contexts by the proposed feature temporal interpolation network. Then, we propose a deformable ConvLSTM to align and aggregate temporal information simultaneously for better leveraging global temporal contexts. Finally, a deep reconstruction network is adopted to predict HR slow-motion video frames. Extensive experiments on benchmark datasets demonstrate that the proposed method not only achieves better quantitative and qualitative performance but also is more than three times faster than recent two-stage state-of-the-art methods, e.g., DAIN+EDVR and DAIN+RBPN.

Authors (6)

Xiaoyu Xiang (26 papers)
Yapeng Tian (80 papers)
Yulun Zhang (167 papers)
Yun Fu (131 papers)
Jan P. Allebach (17 papers)
Chenliang Xu (114 papers)

Citations (155)

View on Semantic Scholar

Summary

Space-Time Video Super-Resolution Using a Unified One-Stage Framework

"Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution" addresses the intricate task of enhancing both frame rate and resolution of video sequences in a unified framework. This paper presents a novel approach that concurrently tackles temporal interpolation and spatial super-resolution as a single, integrated task, rather than the traditional separate two-stage process. This integration exploits the natural interdependence between temporal and spatial dimensions, resulting in both efficient and high-quality video outputs.

The authors propose a one-stage space-time video super-resolution network capable of synthesizing high-resolution, high-frame-rate video frames directly from low-resolution, low-frame-rate input sequences. This model circumvents the explicit interpolation of intermediate low-resolution frames by employing a feature-based temporal interpolation technique. The proposed model not only enhances video quality but does so at a computational efficiency that greatly surpasses existing two-stage methods.

Methodology

The proposed model introduces a deformable feature temporal interpolation (DFTI) network, which uses deformable convolutions to capture local temporal contexts between frames, effectively handling large and complex visual motions. This is achieved without the explicit synthesis of pixel-wise intermediate frames and thus reduces computational complexity. The DFTI's key innovation lies in its ability to interpolate features rather than raw pixels to derive local context-aware representations.

Additionally, the work incorporates a novel deformable ConvLSTM architecture that leverages bidirectional processing for robust temporal information aggregation. This structure simultaneously aligns and integrates temporal information, enhancing the model's capacity to utilize global temporal contexts. The deformable ConvLSTM thus acts as a powerful mechanism to refine the temporal resolution in the presence of complex motions and scene dynamics.

Results

Empirical evaluations involving extensive benchmarks demonstrate that the proposed one-stage framework outperforms leading two-stage approaches both qualitatively and quantitatively. The network achieves significant improvements in PSNR and SSIM scores across various datasets, including Vid4 and Vimeo benchmarks, with particularly pronounced gains observed in scenarios involving fast motion. The authors report that their method is more than three times faster than two-stage counterparts, with a model size approximately four times smaller than combinations of DAIN and EDVR, setting a new standard in STVSR efficiency.

Implications and Future Directions

This research holds significant implications for applications in content creation, such as film and high-definition television production, where high-quality, slow-motion content is increasingly in demand. By addressing the space-time resolution challenge in a unified manner, this work paves the way for more capable and streamlined video processing systems.

Future research could extend this unified framework to accommodate varying resolution and frame rate requirements dynamically, perhaps incorporating adaptive strategies based on scene content or employing neural architecture search techniques to further optimize performance across varying video contexts. The broader scope of applications, such as real-time video enhancement and augmented reality, also presents promising avenues for future exploration.

In conclusion, this paper presents a compelling advancement in the field of video super-resolution, leveraging innovative network architectures and algorithmic strategies to efficiently and effectively upscale video content in both temporal and spatial domains.

PDF Markdown

Related Papers

YouTube

Show All Videos