Learning Texture Transformer Network for Image Super-Resolution (2006.04139v2)

Published 7 Jun 2020 in cs.CV

Abstract: We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases. In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification). Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.

Citations (663)

View on Semantic Scholar

Summary

The paper introduces TTSR which effectively transfers high-resolution textures using transformer-based attention mechanisms to enhance image super-resolution.
It employs a learnable texture extractor with hard and soft attention modules to dynamically match and integrate features from reference images.
Experiments demonstrate that TTSR significantly outperforms traditional SISR and RefSR methods in PSNR and SSIM, setting a new benchmark in texture synthesis.

Learning Texture Transformer Network for Image Super-Resolution

The paper "Learning Texture Transformer Network for Image Super-Resolution" introduces the Texture Transformer Network (TTSR), a novel architecture designed to advance image super-resolution (SR) by effectively transferring high-resolution (HR) textures from reference (Ref) images to low-resolution (LR) images. This paper is positioned within the broader context of SR research, which seeks to enhance image quality by reconstructing realistic textures from pixel-deprived sources. Compared to existing single-image super-resolution (SISR) and reference-based super-resolution (RefSR) frameworks, TTSR capitalizes on transformer architectures, employing attention mechanisms to achieve superior texture recovery.

Key Contributions and Architecture

The TTSR model is composed of several intricate modules, each contributing to the overall efficacy of the network:

Learnable Texture Extractor (LTE): This component is structured to learn refined texture features through an end-to-end training process that dynamically updates parameters. It improves upon traditional feature extraction, such as using VGG-derived features, by facilitating joint feature embedding for LR and Ref images, thus laying a robust foundation for subsequent attention operations.
Relevance Embedding Module: By conceptualizing LR and Ref image features as queries and keys within the transformer architecture, this module computes relevance scores using a normalized inner product. The hard and soft attention maps generated here guide the texture transfer and synthesis processes.
Hard-Attention Module: Distinctively selects and transfers texture features from the Ref image based on the highest relevance score, thereby minimizing blur effects often introduced by weighted feature averaging in traditional attention mechanisms.
Soft-Attention Module: Integrates the transferred HR features with LR features using computed soft-attention maps, enhancing relevant textures while attenuating irrelevant ones.
Cross-Scale Feature Integration (CSFI): This module stacks multiple texture transformers to facilitate cross-scale feature integration, leveraging texture information from various scales for more robust feature learning.

Performance Evaluation

Extensive experiments validate the superior performance of TTSR against state-of-the-art SISR (e.g., RCAN, RDN) and RefSR methods (e.g., SRNTT, CrossNet) on datasets such as CUFED5, Sun80, Urban100, and Manga109. Notably, TTSR achieves significant improvements in PSNR and SSIM metrics, with TTSR-rec mode particularly excelling in these objective quality measures. Subjective evaluations through user studies corroborate the enhanced visual quality provided by TTSR, often preferred over contemporary models.

Implications and Future Directions

TTSR's integration of transformer mechanisms into image generation tasks, particularly for SR, underscores a promising direction for future research. The compelling results open avenues for exploring transformers in diverse domains of computer vision and image processing, including tasks that necessitate nuanced texture synthesis. Future work may delve into optimizing the computational efficiency of TTSR, exploring semi-supervised or unsupervised approaches to further harness texture transfer capabilities, and adapting the architecture for real-time applications.

In summary, by leveraging attention mechanisms and a cross-scale integration strategy, TTSR establishes a new benchmark in the field of super-resolution, offering both compelling theoretical insights and practical improvements in handling high-frequency texture details.