Recurrent Video Restoration Transformer with Guided Deformable Attention (2206.02146v3)

Published 5 Jun 2022 in cs.CV and eess.IV

Abstract: Video restoration aims at restoring multiple high-quality frames from multiple low-quality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.

Citations (121)

View on Semantic Scholar

Summary

The paper introduces RVRT, a framework that integrates parallel processing of local frames with a recurrent architecture for efficient video restoration.
It employs a guided deformable attention mechanism to dynamically align features across clips, ensuring accurate temporal fusion.
RVRT achieves state-of-the-art performance in super-resolution, deblurring, and denoising by balancing model size, efficiency, and restoration quality.

Recurrent Video Restoration Transformer with Guided Deformable Attention

The paper introduces a novel framework, the Recurrent Video Restoration Transformer (RVRT), aimed at enhancing video restoration tasks such as super-resolution, deblurring, and denoising. The proposed RVRT efficiently leverages the strengths of both parallel and recurrent processing methods.

Key Contributions and Methodology

The RVRT framework addresses several challenges inherent in video restoration. Traditional techniques either process all frames in parallel—benefiting from extensive temporal fusion but at the cost of high memory usage—or process frames in a recurrent, frame-by-frame manner, which is parameter efficient but struggles with long-range dependencies. The RVRT balances these approaches by processing local neighboring frames in parallel within a globally recurrent architecture. This setup ensures a better trade-off between model size, efficiency, and effectiveness.

The RVRT divides the video into shorter clips. Each clip is refined sequentially based on previously inferred clip features. Within a clip, the model leverages implicit feature aggregation to jointly update frame features. Across different clips, the Guided Deformable Attention (GDA) mechanism ensures accurate clip-to-clip alignment by dynamically aggregating features from relevant locations.

The introduction of GDA is noteworthy due to its advanced alignment capabilities. Unlike frame-to-frame alignment methods, GDA predicts multiple relevant locations across video clips using optical flow guidance. The attention mechanism dynamically weights these locations, offering flexibility and accuracy in aggregating temporal information.

Experimental Results

The RVRT demonstrates state-of-the-art performance across several benchmark datasets for video super-resolution, deblurring, and denoising. The model excels with robust results in terms of PSNR and SSIM metrics. Notably:

In video super-resolution tasks, RVRT outperforms both well-established recurrent models like BasicVSR++ as well as transformer-based models like VRT.
For video deblurring, significant improvements are observed on datasets like DVD and GoPro.
In video denoising, RVRT shows substantial performance gains, especially under higher noise levels.

These results indicate a compelling balance between high-quality restoration, model size, and computational efficiency.

Implications and Future Directions

The RVRT model offers practical implications for video processing applications requiring high clarity and detail, such as live streaming and old film restoration. In addition, the theoretical contributions of integrating transformer-based attention mechanisms into recurrent frameworks pave the way for further advancements in sequence modeling.

Future developments could explore improvements in the efficiency of optical flow operations to reduce computational overhead further. Additionally, exploring the potential of RVRT in real-time video analytics could open new avenues for its deployment in video surveillance and other latency-sensitive applications.

Conclusion

The RVRT effectively integrates the complementary strengths of existing video restoration methodologies, marking a substantial step forward in video restoration. By employing advanced attention mechanisms and efficient model architectures, it offers a promising solution to the longstanding challenges in restoring high-quality video content.

PDF Markdown

Related Papers

GitHub

GitHub - JingyunLiang/RVRT: Recurrent Video Restoration Transformer with Guided Deformable Attention (NeurlPS2022, official repository) (388 stars)