DeblurDiNAT: A Generalizable Transformer for Perceptual Image Deblurring (2403.13163v4)

Published 19 Mar 2024 in cs.CV

Abstract: Although prior state-of-the-art (SOTA) deblurring networks achieve high metric scores on synthetic datasets, there are two challenges which prevent them from perceptual image deblurring. First, a deblurring model overtrained on synthetic datasets may collapse in a broad range of unseen real-world scenarios. Second, the conventional metrics PSNR and SSIM may not correctly reflect the perceptual quality observed by human eyes. To this end, we propose DeblurDiNAT, a generalizable and efficient encoder-decoder Transformer which restores clean images visually close to the ground truth. We adopt an alternating dilation factor structure to capture local and global blur patterns. We propose a local cross-channel learner to assist self-attention layers to learn short-range cross-channel relationships. In addition, we present a linear feed-forward network and a non-linear dual-stage feature fusion module for faster feature propagation across the network. Compared to nearest competitors, our model demonstrates the strongest generalization ability and achieves the best perceptual quality on mainstream image deblurring datasets with 3%-68% fewer parameters.

PDF HTML Abstract

DeblurDiNAT: A Comprehensive Approach to Transformer-Based Image Deblurring

The field of image deblurring has experienced a transformative shift with the advent of deep learning architectures, particularly Convolutional Neural Networks (CNNs) and Transformers. In the field of Transformers, the paper at hand introduces DeblurDiNAT, a novel lightweight architecture designed for efficient and effective image deblurring. This work challenges the pervasive issues of large model sizes and prolonged inference times that are often associated with Transformer-based models.

Overview of DeblurDiNAT Architecture

DeblurDiNAT stands out by employing a compact encoder-decoder structure that centers around an innovative approach to self-attention. The architecture utilizes an alternating dilation factor strategy within the attention mechanism to capture both local and global features from blurry images. This is crucial for scenarios displaying diverse blur artifacts that demand both fine and coarse feature extraction capabilities.

Key innovations include the Channel Modulation Self-Attention (CMSA) block, which integrates a Cross-Channel Learner (CCL) to efficiently model interactions between feature channels. This tackles the common shortcoming of self-attention mechanisms that often fail to adequately model cross-channel relationships in image data. The architecture also incorporates a Divide and Multiply Feed-Forward Network (DMFN) that substitutes typical non-linear activation function-heavy layers with a streamlined approach, focusing on element-wise multiplications for swift feature propagation. Additionally, the architecture's Lightweight Gated Feature Fusion (LGFF) module enables effective multi-scale feature integration without incurring the high computational costs typically associated with elaborate fusion procedures.

Quantitative and Qualitative Performance

The experimental results underscore the balance DeblurDiNAT maintains between efficiency and performance. On standard datasets such as GoPro, HIDE, RealBlur-R, and RealBlur-J, DeblurDiNAT achieves, if not surpasses, state-of-the-art (SOTA) performance with a notably lower computational footprint. Notably, DeblurDiNAT-L outperforms models such as FFTformer in terms of model efficiency—demonstrating a significant reduction in parameters by up to 68% and faster inference times—while maintaining competitive performance metrics like PSNR and SSIM. This efficiency is further bolstered by the model's robust generalization across both synthetic and real-world datasets, a testament to its architectural design focused on balanced global-local feature learning.

Implications and Future Directions

The implications of DeblurDiNAT extend to a variety of applications where image quality is paramount, from autonomous vehicles to medical imaging. The lightweight nature of this model allows for deployment in resource-constrained environments, a notable advancement over previous architectures whose extensive resource requirements limited practical applicability. The introduction of CMSA and DMFN specifically could inspire further exploration into adaptive attention mechanisms and efficient feed-forward processes across other vision tasks, broadening the scope of efficient Transformer applications beyond deblurring.

Looking ahead, the concepts within DeblurDiNAT may inform the development of more generalized frameworks for handling diverse image restoration tasks. Future research could explore the integration of additional contexts such as temporal information in video or iterative self-supervised approaches that leverage the iterative blurring-deblurring cycles for achieving yet more refined image quality. The lightweight fusion strategies further open a dialogue on enhanced multi-scale processing suitable for various computer vision challenges.

In summary, DeblurDiNAT offers a strategic combination of novel techniques and proven methodologies that substantiate its place as an effective solution for contemporary deblurring challenges, opening an avenue for further innovation within the Transformer paradigm.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Hanzhou Liu (4 papers)
Binghan Li (5 papers)
Chengkai Liu (10 papers)
Mi Lu (8 papers)

Related Papers

Find Related Papers

GitHub

GitHub - HanzhouLiu/DeblurDiNAT (41 stars)

Tweets

https://twitter.com/CSVisionPapers/status/1770818773869547536