Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform (1804.02815v1)

Published 9 Apr 2018 in cs.CV

Abstract: Despite that convolutional neural networks (CNN) have recently demonstrated high-quality reconstruction for single-image super-resolution (SR), recovering natural and realistic texture remains a challenging problem. In this paper, we show that it is possible to recover textures faithful to semantic classes. In particular, we only need to modulate features of a few intermediate layers in a single network conditioned on semantic segmentation probability maps. This is made possible through a novel Spatial Feature Transform (SFT) layer that generates affine transformation parameters for spatial-wise feature modulation. SFT layers can be trained end-to-end together with the SR network using the same loss function. During testing, it accepts an input image of arbitrary size and generates a high-resolution image with just a single forward pass conditioned on the categorical priors. Our final results show that an SR network equipped with SFT can generate more realistic and visually pleasing textures in comparison to state-of-the-art SRGAN and EnhanceNet.

Citations (903)

View on Semantic Scholar

Summary

The paper's key contribution is the Spatial Feature Transform (SFT) layer that integrates semantic segmentation into super-resolution networks.
The method modulates feature maps with learned affine parameters to generate textures that match their semantic context.
User studies and experiments demonstrate that SFT-GAN produces perceptually rich textures, outperforming traditional PSNR-based and other GAN methods.

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform

Introduction

The paper "Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform" by Xintao Wang et al. introduces a novel approach to improve the perceptual quality of image super-resolution (SR). While CNN-based methods have achieved significant advancements in SR, they frequently struggle to recover natural and realistic textures. The authors propose leveraging semantic segmentation as a categorical prior to enhance texture fidelity in the super-resolved images. The cornerstone of this approach is the Spatial Feature Transform (SFT) layer, which modulates intermediate features in the SR network based on semantic segmentation probability maps.

Methodology

The paper's key contribution is the introduction of the SFT layer, which conditions the SR network on semantic information represented by segmentation probability maps. The SFT layer modifies feature maps through learned affine transformations, parameterized by $\gamma$ and $\beta$ . This mechanism allows the SR network to generate textures faithful to the underlying semantic classes without requiring multiple forward passes or separate models for each class.

Spatial Feature Transform Layer

An SFT layer generates the modulation parameters $\gamma$ and $\beta$ based on the segmentation probability maps $\mathbf{P}$ : $\bm{\gamma}, \bm{\beta} = \mathcal{M}(\mathbf{P})$ The parameters are then used to modulate the feature maps $\mathbi{F}$: $\text{SFT}(\mathbi{F}|\bm{\gamma}, \bm{\beta}) = \bm{\gamma} \odot \mathbi{F} + \bm{\beta}$ This approach allows for both feature-wise and spatial-wise transformations, ensuring that the texture generation considers local semantic information.

Network Architecture

The SR network consists of two main components: a condition network and an SR network. The condition network processes the semantic segmentation maps to generate intermediate conditions that are broadcasted to all SFT layers within the SR network. The SR network itself is built with 16 residual blocks, each equipped with SFT layers to effectively utilize the segmentation priors. The network is trained using a combination of perceptual and adversarial losses, promoting the generation of visually appealing textures.

Experimental Results

The authors performed extensive experiments to assess the qualitative performance of their method (SFT-GAN). Comparisons were made against various PSNR-oriented methods such as SRCNN, VDSR, LapSRN, DRRN, and MemNet, as well as GAN-based methods like SRGAN and EnhanceNet. The results showed that SFT-GAN generates more realistic and visually rich textures, particularly in challenging regions such as buildings, animals, and natural landscapes.

User Study

A user paper involving 30 participants was conducted to quantitatively compare the perceptual quality of SFT-GAN against other methods. Participants consistently ranked SFT-GAN higher in terms of realism and visual appeal. Specifically, SFT-GAN surpassed traditional PSNR-oriented methods and showed significant improvements over SRGAN and EnhanceNet.

Implications and Future Work

The introduction of the SFT layer and its successful application in SFT-GAN demonstrates the potential of semantic segmentation priors in enhancing SR quality. This technique opens up new avenues for further improvements in generative modeling tasks, where semantic information can be leveraged to fine-tune output characteristics dynamically.

Conclusion

The paper makes a significant contribution to the field of image super-resolution by addressing the prevalent challenge of texture realism. The innovative use of Spatial Feature Transform layers, conditioned on semantic segmentation maps, enables the generation of high-quality textures that align with the semantic context of different image regions. Future work could explore extending this approach to finer-grained indoor scenes and integrating joint optimization of segmentation and SR networks for enhanced performance.

PDF Markdown