Accurate Image Restoration with Attention Retractable Transformer (2210.01427v4)

Published 4 Oct 2022 in cs.CV

Abstract: Recently, Transformer-based image restoration networks have achieved promising improvements over convolutional neural networks due to parameter-independent global interactions. To lower computational cost, existing works generally limit self-attention computation within non-overlapping windows. However, each group of tokens are always from a dense area of the image. This is considered as a dense attention strategy since the interactions of tokens are restrained in dense regions. Obviously, this strategy could result in restricted receptive fields. To address this issue, we propose Attention Retractable Transformer (ART) for image restoration, which presents both dense and sparse attention modules in the network. The sparse attention module allows tokens from sparse areas to interact and thus provides a wider receptive field. Furthermore, the alternating application of dense and sparse attention modules greatly enhances representation ability of Transformer while providing retractable attention on the input image.We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks. Experimental results validate that our proposed ART outperforms state-of-the-art methods on various benchmark datasets both quantitatively and visually. We also provide code and models at https://github.com/gladzhang/ART.

PDF Abstract

Accurate Image Restoration with Attention Retractable Transformer

This paper introduces the Attention Retractable Transformer (ART) as a novel architecture tailored for image restoration tasks. Image restoration, a ubiquitously complex problem in computer vision, involves the recovery of a high-quality image from its degraded version, manifesting in applications like super-resolution, denoising, and artifact reduction. Traditional deep learning approaches often utilize convolutional neural networks (CNNs) but face inherent limitations in modeling long-range dependencies due to the locality of convolution operations. The ART framework leverages the strengths of Transformer-based architectures to address these limitations by accommodating global interactions and enhancing the receptive field.

Core Contributions and Methodology

The primary innovation of ART lies in its hybrid attention mechanism that integrates both dense and sparse attention strategies. This dual mechanism balances computational efficiency with the ability to capture comprehensive global contexts.

Sparse and Dense Attention Modules: Existing Transformer-based methods typically constrain attention computations within non-overlapping windows, limiting the receptive fields to dense local regions. In contrast, ART introduces Sparse Attention Blocks (SABs) that facilitate token interactions across sparse, non-adjacent areas of an image, effectively increasing the attention span and the model's ability to capture distant dependencies.
Integration of Dense and Sparse Modules: ART's architecture alternates between Dense Attention Blocks (DABs) and SABs to exploit both local and global feature representations. This alternating pattern enhances the model's flexibility in spanning different attention scopes without incurring significant computational overhead.
Efficient Use of Resources: The sparse-dense paradigm not only extends the effective field of view but does so with a manageable increase in computational resources. This efficiency is pivotal for scaling ART to high-resolution images and diverse restoration tasks.

Performance Evaluation

The experimental validation of ART spans three core image restoration tasks: image super-resolution, denoising, and JPEG compression artifact reduction. Across multiple benchmark datasets—Set5, Set14, B100, Urban100, and Manga109—the ART method demonstrates superior performance metrics in terms of PSNR and SSIM, establishing its effectiveness over existing CNN- and Transformer-based models such as EDSR, RCAN, SAN, and the SwinIR.

ART showcases notably improved results in recovering high-frequency details, vital in challenging contexts exemplified in the Urban100 and Manga109 datasets. For instance, ART attains remarkable improvements in PSNR and SSIM compared to its closest Transformer-based competitor, SwinIR, indicating the advantage of its innovative attention strategy.

Implications and Future Directions

ART's framework represents a substantial step in enhancing Transformer-based image restoration models by broadening their receptive fields and balancing this with computational feasibility. The dual attention mechanism offers a versatile toolset that could be tailored or expanded for other computer vision applications beyond those explored in this paper. Given ART's success, future work might extend this framework to address additional image degradation challenges like deblurring or dehazing, or even explore its adaptability to video restoration tasks.

Moreover, further exploration might consider optimizing the sparse-dense attention integration or dynamically adjusting the sparse attention's interval size based on image content analysis to further refine the balance between performance and computational cost.

In summary, the paper offers a compelling advancement in image restoration, leveraging the Transformer architecture's potential to capture intricate and expansive contexts effectively, thus broadening the toolbox for tackling varied low-level vision problems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jiale Zhang (36 papers)
Yulun Zhang (167 papers)
Jinjin Gu (56 papers)
Yongbing Zhang (58 papers)
Linghe Kong (44 papers)
Xin Yuan (198 papers)

Citations (73)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - gladzhang/ART: PyTorch code for our ICLR 2023 paper "Accurate Image Restoration with Attention Retractable Transformer". (155 stars)