MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing (2308.14036v2)

Published 27 Aug 2023 in cs.CV

Abstract: In recent years, Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision due to their global receptive field and adaptability to input. However, the quadratic computational complexity of softmax-attention limits the wide application in image dehazing task, especially for high-resolution images. To address this issue, we propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity. A multi-scale attention refinement module is proposed as a complement to correct the error of the Taylor expansion. Furthermore, we introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales. The design of multi-scale patch embedding is based on three key ideas: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field. Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost. Experimental results on several dehazing benchmarks show that MB-TaylorFormer achieves state-of-the-art (SOTA) performance with a light computational burden. The source code and pre-trained models are available at https://github.com/FVL2020/ICCV-2023-MB-TaylorFormer.

Citations (50)

View on Semantic Scholar

Summary

The paper introduces a Taylor expansion-based approximation of softmax, converting quadratic to linear complexity while retaining long-range dependency modeling.
It integrates a multi-scale attention refinement module that mitigates approximation errors with localized corrections for improved image dehazing.
The multi-branch design with deformable convolutions enhances multi-level semantic extraction, outperforming state-of-the-art methods on key benchmarks.

Overview of MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing

The paper introduces MB-TaylorFormer, a novel Transformer-based model tailored for image dehazing tasks. The approach uniquely harnesses the Taylor expansion to approximate the softmax attention mechanism, transitioning from quadratic to linear computational complexity in the process. This innovation directly addresses the computational inefficiency typically associated with Transformer models in high-resolution, pixel-intensive tasks like image dehazing.

Key Contributions

Taylor Expansion for Efficient Attention: The paper applies the Taylor series expansion to approximate the softmax function in attention mechanisms, yielding linear computational complexity. This is achieved by leveraging the associative properties of matrix multiplication, which maintains the Transformer's efficacy in modeling long-range dependencies without restricting receptive fields through windowing methods.
Multi-scale Attention Refinement: To counteract the approximation error from the Taylor expansion, a multi-scale attention refinement module is introduced. This module refines the self-attention outputs by incorporating local image correlations, offering a corrective mechanism to the Taylor approximation's introduced errors.
Multi-branch Architecture: The MB-TaylorFormer employs a multi-branch design featuring multi-scale patch embedding. By utilizing overlapping deformable convolutions at varying scales, the model captures diverse receptive fields and semantic information. This design is underpinned by the principles of variable receptive field sizes, multi-level semantics, and flexible receptive field shapes, enhancing the model's ability to process features from coarse to fine granularity effectively.

Experimental Validation

MB-TaylorFormer is compared against multiple state-of-the-art methods on diverse image dehazing benchmarks. The experimental results highlight its superiority in terms of PSNR and SSIM metrics, demonstrating significant improvement over other leading models. Notably, the proposed model achieves these results with a reduced computational load and fewer parameters.

On the SOTS-Indoor benchmark, MB-TaylorFormer-B achieves a PSNR of 40.71 dB, outperforming its counterparts with much lower parameter count and MACs.
The model's robustness is further validated on both O-HAZE and Dense-Haze datasets, where it consistently delivers superior dehazing performance.

Implications and Future Directions

The developments presented in MB-TaylorFormer have significant implications for both the theoretical and applied domains of computer vision. The transition to linear complexity in attention mechanisms paves the way for their application in other resolution-sensitive vision tasks, potentially extending to real-time implementations in resource-constrained environments.

In future research, it would be valuable to explore the integration of the Taylor expansion strategy in broader Transformer architectures within various domains of AI. Moreover, investigating the theoretical underpinnings and limits of this expansion could yield further optimizations.

In summary, MB-TaylorFormer represents a significant advancement in applying Transformer models to low-level vision tasks, particularly with its efficient handling of computational resources while maintaining robust performance. This work lays a foundation for further innovations in efficient Transformer designs and their applications across diverse fields of machine learning and artificial intelligence.

PDF Markdown

Related Papers

GitHub

GitHub - FVL2020/ICCV-2023-MB-TaylorFormer (95 stars)