Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention (2401.06312v4)

Published 12 Jan 2024 in cs.CV

Abstract: Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4778–4787, 2017.
  2. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
  3. Basicvsr: The search for essential components in video super-resolution and beyond. arXiv preprint arXiv:2012.02181, 2020.
  4. Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4947–4956, 2021.
  5. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5972–5981, 2022.
  6. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st international conference on image processing, pages 168–172. IEEE, 1994.
  7. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 764–773, 2017.
  8. Efficient video super-resolution through recurrent latent space propagation. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3476–3485. IEEE, 2019.
  9. Skip-convolutions for efficient video processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2695–2704, 2021.
  10. Delta distillation for efficient video processing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 213–229. Springer, 2022.
  11. Temporally distributed networks for fast video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8827, 2020.
  12. Video super-resolution with recurrent structure-detail network. arXiv preprint arXiv:2008.00455, 2020a.
  13. Video super-resolution with temporal group attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
  14. Look back and forth: Video super-resolution with explicit temporal difference modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17411–17420, 2022.
  15. Accel: A corrective fusion network for efficient semantic segmentation on video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866–8875, 2019.
  16. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  17. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3224–3232, 2018.
  18. Mucan: Multi-correspondence aggregation network for video super-resolution. arXiv preprint arXiv:2007.11803, 2020.
  19. Low-latency video semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5997–6005, 2018.
  20. Vrt: A video restoration transformer. arXiv preprint arXiv:2201.12288, 2022a.
  21. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, 35:378–393, 2022b.
  22. Accelerating the training of video super-resolution models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1595–1603, 2023.
  23. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
  24. Learning trajectory-aware transformer for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5687–5696, 2022.
  25. Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2507–2515, 2017.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  27. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  28. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6942–6950, 2019.
  29. Learning spatiotemporal frequency-transformer for compressed video super-resolution. In European Conference on Computer Vision, pages 257–273. Springer, 2022.
  30. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6626–6634, 2018.
  31. Rethinking alignment in video super-resolution transformers. arXiv preprint arXiv:2207.08494, 2022.
  32. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
  33. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4472–4480, 2017.
  34. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3360–3369, 2020.
  35. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 0–0, 2019.
  36. Residual sparsity connection learning for efficient video super-resolution. arXiv preprint arXiv:2206.07687, 2022.
  37. Space-time distillation for video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2113–2122, 2021.
  38. An implicit alignment for video super-resolution. arXiv preprint arXiv:2305.00163, 2023.
  39. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
  40. Deep feature flow for video recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2349–2358, 2017.
  41. Towards high performance video object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7210–7218, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xingyu Zhou (82 papers)
  2. Leheng Zhang (10 papers)
  3. Xiaorui Zhao (5 papers)
  4. Keze Wang (46 papers)
  5. Leida Li (26 papers)
  6. Shuhang Gu (56 papers)

Summary

  • The paper presents a novel transformer architecture that employs masked inter and intra-frame attention to reduce redundant computations effectively.
  • It leverages temporal continuity using an innovative attention block and adaptive mask prediction to optimize resource usage and memory.
  • Empirical evaluations on REDS, Vimeo90K, and Vid4 datasets show high PSNR and efficiency, outperforming leading VSR methods.

Video Super-Resolution Transformer with Masked Inter and Intra-Frame Attention

The paper presents a novel approach to video super-resolution (VSR) through a transformer-based framework that employs masked inter and intra-frame attention mechanisms, termed MIA-VSR. This method seeks to reduce computational cost and memory usage while maintaining state-of-the-art accuracy, thus addressing significant challenges in deploying transformer-based VSR models on devices with constrained resources.

The core of MIA-VSR leverages the temporal continuity present between adjacent video frames, which allows for the reduction of redundant computations. This efficiency is achieved through the introduction of an intra-frame and inter-frame attention block (IIAB), which uses past frame features solely for supplementary information rather than joint processing, drastically decreasing the complexity inherent in self-attention mechanisms. Additionally, the adaptive block-wise mask prediction module selectively skips computations in areas deemed less significant, further optimizing resource usage.

Several experiments and ablation studies underline the effectiveness of the MIA-VSR model. When compared to prominent VSR methods like EDVR, BasicVSR++, and RVRT, MIA-VSR shows comparable or superior PSNR scores with a significant reduction in FLOPs, highlighting its capability of maintaining high-resolution accuracy while operating with fewer computational resources. Specifically, the empirical evaluations demonstrate the MIA-VSR’s ability to achieve high-quality results across REDS, Vimeo90K, and Vid4 datasets, outperforming state-of-the-art models in terms of efficiency.

The practical implications of the MIA-VSR model are considerable, especially in applications requiring real-time processing on edge devices where computational power and memory are limited. Video streaming services, surveillance systems, and multimedia applications could greatly benefit from the efficient high-fidelity output facilitated by this model.

Theoretically, the MIA-VSR’s contribution lies in its novel adaptation of the transformer architecture to exploit temporal redundancies efficiently via selective attention and masking strategies. This approach opens pathways for further research into lightweight transformer adaptations for various video analysis tasks beyond super-resolution.

Future developments could explore integrating more sophisticated masking and attention mechanisms or expanding MIA-VSR’s applications to other temporal pattern recognition tasks, such as video compression and scene recognition, where computational efficiency remains crucial.

Overall, this work represents a meaningful advance in the field of video super-resolution, providing a template for balancing accuracy and efficiency in transformer models applied to real-world scenarios.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com