Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video Super-Resolution Transformer (2106.06847v3)

Published 12 Jun 2021 in cs.CV

Abstract: Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  2. Adversarial learning with local coordinate coding. In International Conference on Machine Learning, 2018.
  3. Multi-marginal wasserstein gan. In Advances in Neural Information Processing Systems, 2019.
  4. Basicvsr: The search for essential components in video super-resolution and beyond. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  5. Pre-trained image processing transformer. In Advances in Neural Information Processing Systems, 2021.
  6. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  7. Deformable convolutional networks. In IEEE international conference on computer vision, 2017.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Closed-loop matters: Dual regression networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  10. Recurrent back-projection network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  11. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  12. Video super-resolution with recurrent structure-detail network. In European Conference on Computer Vision, 2020.
  13. Video super-resolution with temporal group attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  14. Revisiting temporal modeling for video super-resolution. In The British Machine Vision Conference, 2020.
  15. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE conference on computer vision and pattern recognition, 2018.
  16. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  17. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
  18. C. Liu and D. Sun. On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 2013.
  19. I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  20. E. Malach and S. Shalev-Shwartz. Computational separation between convolutional and fully-connected networks. In International Conference on Learning Representations, 2021.
  21. Cola-net: Collaborative attention network for image restoration. IEEE Transactions on Multimedia, 2021.
  22. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  23. A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In IEEE conference on computer vision and pattern recognition, 2017.
  24. Frame-recurrent video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  25. Detail-revealing deep video super-resolution. In IEEE International Conference on Computer Vision, 2017.
  26. Tdan: Temporally-deformable alignment network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  27. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  28. Deformable non-local network for video super-resolution. IEEE Access, 2019.
  29. Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  30. Basicsr. https://github.com/xinntao/BasicSR, 2020.
  31. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503, 2020.
  32. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 2004.
  33. Adversarial sparse transformer for time series forecasting. In Advances in Neural Information Processing Systems, 2020.
  34. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8), 2019.
  35. Learning texture transformer network for image super-resolution. In CVPR, 2020.
  36. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiezhang Cao (38 papers)
  2. Yawei Li (72 papers)
  3. Kai Zhang (542 papers)
  4. Luc Van Gool (570 papers)
Citations (151)

Summary

Video Super-Resolution Transformer

In the paper titled "Video Super-Resolution Transformer," the authors address the challenges of video super-resolution (VSR) through an adaptation of the Transformer architecture traditionally used in NLP tasks. VSR aims to reconstruct high-resolution (HR) videos from low-resolution (LR) input sequences by leveraging temporal and spatial information. The authors identify two main challenges when applying traditional Transformer architectures to VSR: the neglect of spatial locality in self-attention layers and the lack of feature alignment capabilities in token-wise feed-forward layers.

To tackle these challenges, the authors propose a novel architecture called the VSR-Transformer, consisting of a spatial-temporal convolutional self-attention (STCSA) layer and a bidirectional optical flow-based feed-forward (BOFF) layer. The STCSA layer is designed to capture locality and spatial-temporal dependencies by integrating convolution operations directly into the self-attention mechanism. This is theorized to better capture the dependencies across video frames compared to the fully connected self-attention layers that traditionally only handle linear global dependencies.

The BOFF layer aims to improve feature alignment across video frames by using optical flow techniques, enabling the model to leverage bidirectional temporal information for better feature alignment and propagation. The authors argue that by using optical flows, the model achieves a better understanding of the correlations between frames, which is essential for accurate video reconstruction.

Significantly, the paper provides a theoretical underpinning for the design choices of their architecture. By comparing the theoretical capabilities of their proposed layers against traditional fully connected self-attention layers, the authors establish the superiority of the STCSA in learning local patterns, which are crucial for video sequence modeling. Specifically, their theoretical results imply that the explicit modeling of spatial information through convolutional layers leads to more effective learning dynamics, especially when combined with gradient-descent optimization.

Empirical results demonstrate the effectiveness of the VSR-Transformer framework. The authors report experiments conducted on several benchmark datasets such as REDS, Vimeo-90K, and Vid4. Their architecture yields superior PSNR and SSIM scores compared to state-of-the-art methods like EDVR-L and IconVSR, particularly when limited to processing short sequences of 5 to 7 frames. The results indicate that the VSR-Transformer is capable of generating sharper and higher-quality high-resolution frames, substantiating their design objectives.

While the model size of VSR-Transformer exceeds that of some competing architectures, the authors emphasize the trade-off between computational efficiency and performance. The reported performance gains are justified for applications where the quality of the video output is critical, such as video surveillance and high-definition television.

Overall, the paper's contribution to the field of video super-resolution not only involves an innovative architectural design but also includes a theoretical analysis that strengthens the argument for replacing fully connected self-attention layers in sequence modeling tasks with convolutional alternatives. This work opens avenues for future exploration, such as integrating further optimization strategies to mitigate computational overhead and deploying the architecture in practical applications where video quality is paramount.

Github Logo Streamline Icon: https://streamlinehq.com