- The paper introduces DehazeFormer, a novel Vision Transformer variant that enhances image dehazing using innovations like RescaleNorm and SoftReLU.
- It achieves high performance with PSNR exceeding 40 dB on the SOTS indoor set while reducing computational cost compared to traditional CNN approaches.
- The study also presents the RS-Haze dataset for remote sensing, establishing a new benchmark for evaluating non-homogeneous haze scenarios.
Vision Transformers for Single Image Dehazing
The paper, titled "Vision Transformers for Single Image Dehazing", introduces DehazeFormer, a novel Transformer-based architecture tailored towards enhancing the dehazing process for single images. Traditional methods for image dehazing have heavily relied on Convolutional Neural Networks (CNNs), but this paper explores the application of Vision Transformers (ViTs) in this context. The authors identify key shortcomings in the Swin Transformer when applied to low-level vision tasks like image dehazing and propose modifications to address these.
Key Contributions
- DehazeFormer Design:
- The paper introduces DehazeFormer, proposing improvements such as modified normalization layers, activation functions, and spatial information aggregation schemes.
- Rescale Layer Normalization (RescaleNorm) is introduced to preserve inter-patch relativity and reintroduce the statistics of the feature map post-normalization.
- A new non-linear activation function, SoftReLU, is presented to replace GELU, which is argued to be less suitable for image dehazing.
- Shifted window partitioning with reflection padding is proposed to maintain constant window sizes along image edges, addressing limitations of the cyclic shift in Swin Transformers.
- High Performance:
- DehazeFormer achieves significant improvements over traditional CNN-based methods. On the SOTS indoor set, DehazeFormer's large model exceeds 40 dB in PSNR, marking it as a leader among image dehazing methods.
- Introduction of a Remote Sensing Dataset:
- The authors introduce a synthetic dataset, RS-Haze, to evaluate performance in the context of highly non-homogeneous haze, which is prevalent in remote sensing images.
Results and Evaluation
The paper provides a detailed experimental setup for both training and testing across several datasets, including RESIDE and RS-Haze. The quantitative results showcase that DehazeFormer not only outperforms CNN-based methods but also offers lower parameter requirements and reduced computational costs.
- On RESIDE datasets, DehazeFormer consistently shows superior results in PSNR and SSIM when compared to well-established methods like FFA-Net and AECR-Net.
- Qualitative comparisons depict DehazeFormer's ability to produce clearer, more artifact-free images.
Practical and Theoretical Implications
The introduction of DehazeFormer offers an alternative to CNN-dominated methods, leveraging the strengths of Vision Transformers. This approach can potentially reshape dehazing methodologies by improving accuracy and efficiency. The architectural innovations suggest broader applicability beyond dehazing, possibly influencing other low-level vision tasks.
Future Directions
The research opens new avenues for extending Transformer architectures to low-level vision tasks. Future investigations could focus on:
- Streamlining the model for real-time dehazing applications.
- Extending the proposed architectural paradigms to other vision applications beyond dehazing.
- Exploring the potential of the introduced RS-Haze dataset for training models specific to remote sensing applications.
The paper underscores the transformative potential of Vision Transformers in image dehazing, presenting a compelling case for their adoption in future research and applications.