Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
(1609.05158v2)
Published 16 Sep 2016 in cs.CV and stat.ML
Abstract: Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a single filter, commonly bicubic interpolation, before reconstruction. This means that the super-resolution (SR) operation is performed in HR space. We demonstrate that this is sub-optimal and adds computational complexity. In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation. We evaluate the proposed approach using images and videos from publicly available datasets and show that it performs significantly better (+0.15dB on Images and +0.39dB on Videos) and is an order of magnitude faster than previous CNN-based methods.
This paper introduces the Efficient Sub-Pixel Convolutional Neural Network (ESPCN) for real-time single image and video super-resolution. The core contribution is a novel network architecture that extracts feature maps from the low-resolution (LR) input image directly and performs the upsampling to high-resolution (HR) only at the very end using a learned sub-pixel convolution layer. This approach contrasts with previous CNN-based methods, such as SRCNN (Dong et al., 2014), which typically upscale the LR image using a fixed filter (like bicubic interpolation) before feeding it into the network.
The paper highlights two main drawbacks of the previous approach:
Increased Computational Complexity: Performing convolution operations on the already upscaled HR image is computationally expensive, especially for large images and high upscaling factors (r). The cost scales with the square of the upscale factor (r2).
Sub-optimal Upscaling: Using a fixed, handcrafted filter for initial upsampling before the CNN does not add information relevant to the ill-posed SR problem and might limit the network's ability to learn the optimal LR-to-HR mapping.
The ESPCN architecture addresses these issues by processing feature extraction entirely in the LR space. The network consists of L−1 convolutional layers that operate on the LR input to produce a set of LR feature maps. The final layer is the proposed sub-pixel convolution layer, which takes the LR feature maps and aggregates them to construct the final HR output image.
The sub-pixel convolution layer effectively replaces the initial fixed upsampling. Instead of convolving in HR space, it performs convolution in LR space and then rearranges the resulting feature map elements to form the HR image.
Specifically, if the network outputs C⋅r2 feature maps of size H×W (where H×W is the LR resolution, C is the number of color channels in the HR output, and r is the upscale factor), the sub-pixel convolution layer reorganizes these feature maps into an HR image of size rH×rW×C. This is done using a periodic shuffling operator (PS).
Mathematically, the periodic shuffling operator PS takes a tensor T of shape H×W×C⋅r2 and outputs a tensor of shape rH×rW×C:
$PS(T)_{x,y,c} = T_{\floor{\nicefrac{x}{r}},\floor{\nicefrac{y}{r}},C \cdot r \cdot \text{mod}(y,r) + C \cdot \text{mod}(x,r) + c}$
where x,y are coordinates in the HR output, ⌊⋅⌋ denotes the floor function, mod is the modulo operator, and c is the channel index.
This implementation means that the convolution operation in the final layer produces feature maps at LR resolution, and the upscaling is achieved by simply rearranging the pixels from these C⋅r2 feature maps into the HR grid. This is significantly more efficient computationally than performing convolutions in the HR space.
Practical Implementation Details:
Network Structure: A simple 3-layer CNN is used for feature extraction in LR space. The first layer uses 5×5 filters, the second 3×3, and the final sub-pixel convolution layer uses 3×3 filters. The number of feature maps decreases in subsequent layers before the final layer outputs C⋅r2 feature maps.
Training: The network is trained end-to-end using mean squared error (MSE) between the predicted HR image and the ground truth HR image. Training data consists of LR/HR image pairs generated by blurring and downsampling HR images. Sub-images (patches) are extracted for training.
Periodic Shuffling during Training: The paper notes that the periodic shuffling can be computationally intensive during training. An optimization is to pre-shuffle the ground truth training data to match the expected output format of the layer before the periodic shuffling operation. This avoids performing the shuffle during forward and backward passes, making the training log2r2 times faster compared to a deconvolution approach.
Activation Function: Tanh activation is found to perform better than ReLU for this task.
Upscaling Multiple Channels: For color images (e.g., YCbCr), the network is typically trained on the luminance channel (Y) where most high-frequency details are concentrated. The chrominance channels (Cb, Cr) are often upscaled separately using simpler methods like bicubic interpolation. However, the sub-pixel convolution layer inherently supports upscaling all channels simultaneously by outputting C⋅r2 feature maps and arranging them into an rH×rW×C tensor. The paper trains on Y channel and evaluates PSNR on Y, but the architecture allows full color SR.
Key Results and Applications:
Speed: ESPCN is significantly faster than previous CNN methods like SRCNN and TNRD. On a K2 GPU, it achieved an average processing time of 4.7ms per image from Set14 (for 3× upscale), enabling real-time (>= 30 FPS) super-resolution of 1080p HD video on a single GPU. This is nearly an order of magnitude faster than SRCNN.
Accuracy: ESPCN achieves state-of-the-art or competitive PSNR scores on standard image and video benchmark datasets (Set5, Set14, BSD300, BSD500, SuperTexture, Xiph, Ultra Video Group) compared to SRCNN and TNRD. The improvement is particularly noticeable on video datasets.
Real-time Video SR: The efficiency gain is crucial for video applications, making it practical to super-resolve HD videos frame by frame in real-time.
Implementation Considerations:
Computational Requirements: While significantly faster than predecessors, real-time HD video SR still requires a capable GPU. The paper used a K2 GPU.
Memory: Processing feature maps in LR space also reduces memory consumption compared to processing in HR space, making it more feasible for larger inputs.
Training Data: Training on a large dataset like ImageNet significantly improves performance. The number of parameters is smaller than SRCNN 9-5-5, further contributing to efficiency.
Framework Support: Implementing the periodic shuffling operator efficiently requires support in deep learning frameworks. Modern frameworks like TensorFlow and PyTorch provide operations specifically designed for this purpose (e.g., tf.nn.depth_to_space, torch.pixel_shuffle), making implementation straightforward.
The ESPCN paper demonstrates that performing computations in the LR domain and learning the upsampling filters as part of the network training is a highly effective strategy for efficient and accurate super-resolution, particularly for real-time video applications. The sub-pixel convolution layer is the key enabler for this efficiency.