Video Super-Resolution Transformer (2106.06847v3)

Published 12 Jun 2021 in cs.CV

Abstract: Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.

References (36)

Authors (4)

Jiezhang Cao (38 papers)
Yawei Li (72 papers)
Kai Zhang (542 papers)
Luc Van Gool (570 papers)

Citations (151)

View on Semantic Scholar

Summary

Video Super-Resolution Transformer

In the paper titled "Video Super-Resolution Transformer," the authors address the challenges of video super-resolution (VSR) through an adaptation of the Transformer architecture traditionally used in NLP tasks. VSR aims to reconstruct high-resolution (HR) videos from low-resolution (LR) input sequences by leveraging temporal and spatial information. The authors identify two main challenges when applying traditional Transformer architectures to VSR: the neglect of spatial locality in self-attention layers and the lack of feature alignment capabilities in token-wise feed-forward layers.

To tackle these challenges, the authors propose a novel architecture called the VSR-Transformer, consisting of a spatial-temporal convolutional self-attention (STCSA) layer and a bidirectional optical flow-based feed-forward (BOFF) layer. The STCSA layer is designed to capture locality and spatial-temporal dependencies by integrating convolution operations directly into the self-attention mechanism. This is theorized to better capture the dependencies across video frames compared to the fully connected self-attention layers that traditionally only handle linear global dependencies.

The BOFF layer aims to improve feature alignment across video frames by using optical flow techniques, enabling the model to leverage bidirectional temporal information for better feature alignment and propagation. The authors argue that by using optical flows, the model achieves a better understanding of the correlations between frames, which is essential for accurate video reconstruction.

Significantly, the paper provides a theoretical underpinning for the design choices of their architecture. By comparing the theoretical capabilities of their proposed layers against traditional fully connected self-attention layers, the authors establish the superiority of the STCSA in learning local patterns, which are crucial for video sequence modeling. Specifically, their theoretical results imply that the explicit modeling of spatial information through convolutional layers leads to more effective learning dynamics, especially when combined with gradient-descent optimization.

Empirical results demonstrate the effectiveness of the VSR-Transformer framework. The authors report experiments conducted on several benchmark datasets such as REDS, Vimeo-90K, and Vid4. Their architecture yields superior PSNR and SSIM scores compared to state-of-the-art methods like EDVR-L and IconVSR, particularly when limited to processing short sequences of 5 to 7 frames. The results indicate that the VSR-Transformer is capable of generating sharper and higher-quality high-resolution frames, substantiating their design objectives.

While the model size of VSR-Transformer exceeds that of some competing architectures, the authors emphasize the trade-off between computational efficiency and performance. The reported performance gains are justified for applications where the quality of the video output is critical, such as video surveillance and high-definition television.

Overall, the paper's contribution to the field of video super-resolution not only involves an innovative architectural design but also includes a theoretical analysis that strengthens the argument for replacing fully connected self-attention layers in sequence modeling tasks with convolutional alternatives. This work opens avenues for future exploration, such as integrating further optimization strategies to mitigate computational overhead and deploying the architecture in practical applications where video quality is paramount.

PDF Markdown

Related Papers

GitHub

GitHub - caojiezhang/VSR-Transformer: PyTorch implementation of VSR-Transformer (259 stars)