LoFormer: Local Frequency Transformer for Image Deblurring (2407.16993v1)

Published 24 Jul 2024 in cs.CV

Abstract: Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing fine-grained details, we introduce a novel approach termed Local Frequency Transformer (LoFormer). Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows. These operations offer the advantage of (1) ensuring equitable learning opportunities for both coarse-grained structures and fine-grained details, and (2) exploring a broader range of representational properties compared to coarse-grained global SA methods. Additionally, we introduce an MLP Gating mechanism complementary to Freq-LC, which serves to filter out irrelevant features while enhancing global learning capabilities. Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs. https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur

Authors (5)

Xintian Mao (4 papers)
Jiansheng Wang (3 papers)
Xingran Xie (4 papers)
Qingli Li (40 papers)
Yan Wang (733 papers)

Citations (3)

View on Semantic Scholar

Summary

LoFormer: Local Frequency Transformer for Image Deblurring

The presented research introduces "LoFormer," a novel architecture for image deblurring that leverages the advantages of frequency domain processing. Image deblurring is a challenging task in computer vision, often impeded by the computational complexity of self-attention (SA) mechanisms, particularly when applied to high-resolution images. Traditional approaches either adopt local self-attention, limiting their capacity to capture long-range dependencies, or resort to global self-attention, sacrificing fine-grained correlations. LoFormer proposes a Local Frequency Transformer approach to overcome these limitations by effectively balancing both coarse-grained and fine-grained modeling.

Key Innovations and Methodology

Local Channel-wise Self-Attention in the Frequency Domain (Freq-LC): LoFormer employs a novel Freq-LC mechanism that operates in the frequency domain. This approach involves splitting image features into frequency tokens using the Discrete Cosine Transform (DCT). These tokens are then divided into non-overlapping local windows that capture long-range dependencies and fine-grained details independently within low- and high-frequency components. This method ensures that both properties are given equitable learning opportunities, surpassing coarse-grained global self-attention by exploring a broader representational spectrum.
MLP Gating Mechanism (MGate): Complementary to Freq-LC, the introduction of a Multi-Layer Perceptron (MLP) gating mechanism enhances global feature learning by filtering out irrelevant features, a process critical for refining the representation learned from frequency-based self-attention. This gating mechanism selectively influences feature aggregation, therefore providing a richer, cleaner representation for subsequent tasks.
Architecture and Complexity Considerations: LoFormer is structured within a UNet-inspired architecture, aligning with existing efficient image restoration frameworks, yet distinctively optimized by substituting spatial domain processing with frequency domain analysis. Computationally, the addition of DCT introduces only a modest overhead, with complexity logarithmic in the number of pixels, making it efficient while achieving superior performance.

Experimental Results

The LoFormer model demonstrates significant advancements in image deblurring tasks across various datasets. Notably, LoFormer achieved a Peak Signal-to-Noise Ratio (PSNR) of 34.09 dB on the GoPro dataset with computable Flops of 126G. This marked improvement surpasses existing approaches like Restormer, which traditionally employs spatial-global channel-wise attention mechanisms.

The innovation of analyzing local frequency content allows LoFormer to outperform other state-of-the-art deblurring networks by managing to better retain details and improve overall clarity in deblurred images. The ablation studies confirmed the effectiveness of each component of LoFormer, showcasing improvements over baseline methods not using frequency domain techniques.

Implications and Future Directions

The use of frequency-based attention mechanisms marks a significant step forward in leveraging signal processing methodologies in deep learning architectures for image restoration tasks. LoFormer's architecture suggests that frequency domain transformations can provide crucial advantages in balancing computational efficiency with representational richness, enabling high-quality reconstructions.

Future research can build on these insights to explore further enhancements in other vision tasks such as super-resolution and denoising, and to investigate the integration of other frequency-based methods like Fourier Transform within similar architectural frameworks. Moreover, extending the approach to video processing and other modalities offers an intriguing avenue for further exploration in multimodal data enhancement tasks.

LoFormer stands as a robust, efficient method paving a pathway for more comprehensive applications of frequency domain methodologies in advanced image processing tasks using neural networks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - DeepMed-Lab-ECNU/Single-Image-Deblur (23 stars)

Tweets

https://twitter.com/CSVisionPapers/status/1816486692351611155

YouTube

Show All Videos