Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention (2401.06312v4)
Abstract: Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
- Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4778–4787, 2017.
- Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
- Basicvsr: The search for essential components in video super-resolution and beyond. arXiv preprint arXiv:2012.02181, 2020.
- Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4947–4956, 2021.
- Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5972–5981, 2022.
- Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st international conference on image processing, pages 168–172. IEEE, 1994.
- Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 764–773, 2017.
- Efficient video super-resolution through recurrent latent space propagation. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3476–3485. IEEE, 2019.
- Skip-convolutions for efficient video processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2695–2704, 2021.
- Delta distillation for efficient video processing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 213–229. Springer, 2022.
- Temporally distributed networks for fast video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8827, 2020.
- Video super-resolution with recurrent structure-detail network. arXiv preprint arXiv:2008.00455, 2020a.
- Video super-resolution with temporal group attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
- Look back and forth: Video super-resolution with explicit temporal difference modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17411–17420, 2022.
- Accel: A corrective fusion network for efficient semantic segmentation on video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866–8875, 2019.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3224–3232, 2018.
- Mucan: Multi-correspondence aggregation network for video super-resolution. arXiv preprint arXiv:2007.11803, 2020.
- Low-latency video semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5997–6005, 2018.
- Vrt: A video restoration transformer. arXiv preprint arXiv:2201.12288, 2022a.
- Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, 35:378–393, 2022b.
- Accelerating the training of video super-resolution models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1595–1603, 2023.
- On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
- Learning trajectory-aware transformer for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5687–5696, 2022.
- Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2507–2515, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
- Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6942–6950, 2019.
- Learning spatiotemporal frequency-transformer for compressed video super-resolution. In European Conference on Computer Vision, pages 257–273. Springer, 2022.
- Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6626–6634, 2018.
- Rethinking alignment in video super-resolution transformers. arXiv preprint arXiv:2207.08494, 2022.
- Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
- Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4472–4480, 2017.
- Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3360–3369, 2020.
- Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 0–0, 2019.
- Residual sparsity connection learning for efficient video super-resolution. arXiv preprint arXiv:2206.07687, 2022.
- Space-time distillation for video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2113–2122, 2021.
- An implicit alignment for video super-resolution. arXiv preprint arXiv:2305.00163, 2023.
- Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
- Deep feature flow for video recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2349–2358, 2017.
- Towards high performance video object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7210–7218, 2018.
- Xingyu Zhou (82 papers)
- Leheng Zhang (10 papers)
- Xiaorui Zhao (5 papers)
- Keze Wang (46 papers)
- Leida Li (26 papers)
- Shuhang Gu (56 papers)