Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention (2401.06312v4)

Published 12 Jan 2024 in cs.CV

Abstract: Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.

References (41)

Authors (6)

Xingyu Zhou (82 papers)
Leheng Zhang (10 papers)
Xiaorui Zhao (5 papers)
Keze Wang (46 papers)
Leida Li (26 papers)
Shuhang Gu (56 papers)

Summary

The paper presents a novel transformer architecture that employs masked inter and intra-frame attention to reduce redundant computations effectively.
It leverages temporal continuity using an innovative attention block and adaptive mask prediction to optimize resource usage and memory.
Empirical evaluations on REDS, Vimeo90K, and Vid4 datasets show high PSNR and efficiency, outperforming leading VSR methods.

Video Super-Resolution Transformer with Masked Inter and Intra-Frame Attention

The paper presents a novel approach to video super-resolution (VSR) through a transformer-based framework that employs masked inter and intra-frame attention mechanisms, termed MIA-VSR. This method seeks to reduce computational cost and memory usage while maintaining state-of-the-art accuracy, thus addressing significant challenges in deploying transformer-based VSR models on devices with constrained resources.

The core of MIA-VSR leverages the temporal continuity present between adjacent video frames, which allows for the reduction of redundant computations. This efficiency is achieved through the introduction of an intra-frame and inter-frame attention block (IIAB), which uses past frame features solely for supplementary information rather than joint processing, drastically decreasing the complexity inherent in self-attention mechanisms. Additionally, the adaptive block-wise mask prediction module selectively skips computations in areas deemed less significant, further optimizing resource usage.

Several experiments and ablation studies underline the effectiveness of the MIA-VSR model. When compared to prominent VSR methods like EDVR, BasicVSR++, and RVRT, MIA-VSR shows comparable or superior PSNR scores with a significant reduction in FLOPs, highlighting its capability of maintaining high-resolution accuracy while operating with fewer computational resources. Specifically, the empirical evaluations demonstrate the MIA-VSR’s ability to achieve high-quality results across REDS, Vimeo90K, and Vid4 datasets, outperforming state-of-the-art models in terms of efficiency.

The practical implications of the MIA-VSR model are considerable, especially in applications requiring real-time processing on edge devices where computational power and memory are limited. Video streaming services, surveillance systems, and multimedia applications could greatly benefit from the efficient high-fidelity output facilitated by this model.

Theoretically, the MIA-VSR’s contribution lies in its novel adaptation of the transformer architecture to exploit temporal redundancies efficiently via selective attention and masking strategies. This approach opens pathways for further research into lightweight transformer adaptations for various video analysis tasks beyond super-resolution.

Future developments could explore integrating more sophisticated masking and attention mechanisms or expanding MIA-VSR’s applications to other temporal pattern recognition tasks, such as video compression and scene recognition, where computational efficiency remains crucial.

Overall, this work represents a meaningful advance in the field of video super-resolution, providing a template for balancing accuracy and efficiency in transformer models applied to real-world scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - LabShuHangGU/MIA-VSR (106 stars)

Tweets

https://twitter.com/MuzafferKal_/status/1747123797944332456