Cross Aggregation Transformer for Image Restoration (2211.13654v2)

Published 24 Nov 2022 in cs.CV

Abstract: Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results. Considering the high computational complexity of Transformer with global attention, some methods use the local square window to limit the scope of self-attention. However, these methods lack direct interaction among different windows, which limits the establishment of long-range dependencies. To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT). The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows. We also introduce the Axial-Shift operation for different window interactions. Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation invariance and locality) into Transformer, enabling global-local coupling. Extensive experiments demonstrate that our CAT outperforms recent state-of-the-art methods on several image restoration applications. The code and models are available at https://github.com/zhengchen1999/CAT.

PDF Abstract

Cross Aggregation Transformer for Image Restoration: An Expert Overview

The field of image restoration has seen significant advancements with the adoption of deep learning paradigms, particularly leveraging architectures in the domain of natural language processing such as Transformers. Traditional convolutional neural networks (CNNs) have dominated this area, effectively addressing tasks like image super-resolution (SR), denoising, and compression artifact reduction. However, while CNNs are adept at capturing local features, their capacity to model long-range dependencies across an image can be limited. This paper introduces a new Transformer-based model, the Cross Aggregation Transformer (CAT), designed to address these limitations with a novel self-attention mechanism tailored for image restoration.

Core Innovations in Cross Aggregation Transformer

Rectangle-Window Self-Attention (Rwin-SA): The CAT model's pivotal component is the Rwin-SA mechanism. Unlike traditional square window attention techniques, Rwin-SA employs rectangular windows that permit attention operations to process horizontal and vertical rectangle windows in parallel but distinct heads. This parallel processing significantly enhances the model's ability to capture directional dependencies and expand the receptive field across the image without exponentially increasing computational overhead. The decision to use rectangular rather than square windows marks a distinct departure from methods like SwinIR, providing a more nuanced approach to capturing diverse textural and structural features.
Axial-Shift Operation: To further augment the interaction across the various windows, CAT incorporates a novel axial-shift operation. This is an evolution of the traditional sliding window technique, creating explicit interactions between horizontal-horizontal and vertical-vertical windows while implicitly connecting horizontal to vertical windows. Such a design ensures more comprehensive cross-window feature integration, enhancing the model's effectiveness in aggregating information over larger image regions.
Locality Complementary Module (LCM): Recognizing the importance of CNN's inductive biases (such as translation invariance and locality), the authors introduce LCM as a complementary mechanism to the self-attention framework. By incorporating convolution operations directly on the value computations of the Transformer’s attention blocks, LCM effectively bridges local feature extraction with global dependency modeling, thus enhancing the coupling of these two critical modalities in image restoration tasks.

Empirical Results and Implications

The proposed CAT model undergoes extensive experiments across multiple benchmarks for image SR, JPEG compression artifact reduction, and real image denoising. The results are compelling, with CAT outperforming notably recent state-of-the-art models, including SwinIR and various CNN-based approaches, on several metrics, with particular prowess in datasets containing complex directional textures such as Urban100.

By achieving superior performance without a significant computational footprint compared to legacy models, the CAT approach demonstrates the potential of Transformer-based architectures tailored to low-level vision tasks. Its design philosophy reinforces the viability of integrating global feature extraction capabilities alongside robust local feature handling, capturing a multifaceted spatial hierarchy.

Future Directions and Considerations

Given the demonstrated effectiveness of CAT in the domain of image restoration, several future research directions are outlined. Enhancing the scalability of the Rwin-SA for ultra-high-resolution images could further extend its applicability. Hybrid approaches that further integrate CNNs with transformers at a granular level might yield even more robust restorative capabilities, especially in domains with intricate texture patterns.

On the theoretical front, systematic explorations into the bounds of attention windows (their shapes and sizes) and their implications on computational efficiency remain an open area. Evaluating these designs' impacts on hardware acceleration for real-time applications also presents significant practical importance.

In conclusion, the CAT brings forth a paradigm conducive to efficiently balancing the Transformer’s global perspective with CNN's focused local processing, marking a noteworthy advance toward versatile and high-performance image restoration frameworks.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zheng Chen (221 papers)
Yulun Zhang (167 papers)
Jinjin Gu (56 papers)
Yongbing Zhang (58 papers)
Linghe Kong (44 papers)
Xin Yuan (198 papers)

Citations (99)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - zhengchen1999/CAT: PyTorch code for our NeurIPS 2022 paper "Cross Aggregation Transformer for Image Restoration" (114 stars)

Tweets

https://twitter.com/georvitymusic/status/1597135662829907968