Video Super-Resolution Transformer (2106.06847v3)
Abstract: Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.
- Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Adversarial learning with local coordinate coding. In International Conference on Machine Learning, 2018.
- Multi-marginal wasserstein gan. In Advances in Neural Information Processing Systems, 2019.
- Basicvsr: The search for essential components in video super-resolution and beyond. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Pre-trained image processing transformer. In Advances in Neural Information Processing Systems, 2021.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Deformable convolutional networks. In IEEE international conference on computer vision, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Closed-loop matters: Dual regression networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Recurrent back-projection network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Video super-resolution with recurrent structure-detail network. In European Conference on Computer Vision, 2020.
- Video super-resolution with temporal group attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Revisiting temporal modeling for video super-resolution. In The British Machine Vision Conference, 2020.
- Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE conference on computer vision and pattern recognition, 2018.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
- C. Liu and D. Sun. On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 2013.
- I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- E. Malach and S. Shalev-Shwartz. Computational separation between convolutional and fully-connected networks. In International Conference on Learning Representations, 2021.
- Cola-net: Collaborative attention network for image restoration. IEEE Transactions on Multimedia, 2021.
- Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
- A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In IEEE conference on computer vision and pattern recognition, 2017.
- Frame-recurrent video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Detail-revealing deep video super-resolution. In IEEE International Conference on Computer Vision, 2017.
- Tdan: Temporally-deformable alignment network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Deformable non-local network for video super-resolution. IEEE Access, 2019.
- Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
- Basicsr. https://github.com/xinntao/BasicSR, 2020.
- End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503, 2020.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 2004.
- Adversarial sparse transformer for time series forecasting. In Advances in Neural Information Processing Systems, 2020.
- Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8), 2019.
- Learning texture transformer network for image super-resolution. In CVPR, 2020.
- Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, 2018.
- Jiezhang Cao (38 papers)
- Yawei Li (72 papers)
- Kai Zhang (542 papers)
- Luc Van Gool (570 papers)