DehazeFormer: Transformer-Based Image Dehazing
- DehazeFormer is a transformer-based dehazing network that leverages a modified Swin-Transformer and a physical haze model for precise image restoration.
- It introduces key innovations like RescaleNorm, selective skip fusion, and combined window-based attention with local convolution to boost quantitative performance.
- The model achieves state-of-the-art PSNR/SSIM on standard benchmarks while significantly reducing compute and parameter requirements compared to CNN-based methods.
DehazeFormer refers to a family of single-image dehazing networks that unify vision transformer architectures with tailored normalization, attention, and aggregation mechanisms for robust haze removal on both synthetic and real-world datasets. Developed to address the limitations of CNN-centric approaches, DehazeFormer advances the quantitative state of the art while dramatically reducing compute and parameter count compared to prior methods such as FFA-Net and AECR-Net. The following article systematically details DehazeFormer's principles, architecture, algorithmic innovations, evaluation protocols, and influence on subsequent dehazing frameworks (Song et al., 2022).
1. Physical Degradation Model Underpinning DehazeFormer
DehazeFormer, as with most physical-model-inspired image restoration networks, grounds its predictive process in the atmospheric scattering (haze) model:
Here, is the observed hazy pixel, the latent clean image, a global 3D atmospheric light vector, and is the transmission map typically parameterized as
where is the scattering coefficient, and is the scene depth at pixel .
DehazeFormer is not a direct inversion of this formula; instead, it incorporates inductive biases (e.g., explicit skip connections and spatial context modules) that allow its transformer backbone to efficiently encode local and global information relevant to this physical process (Song et al., 2022).
2. Architectural Overview
The architecture of DehazeFormer is defined by a five-stage, U-Net–style encoder–decoder employing customized Swin-Transformer-derived blocks with the following key specifications:
- Input preprocessing: A strided convolution (patch embedding) lowers the spatial resolution by 1/4, generating a tokenized feature map.
- Network stages: Five consecutive encoding and decoding stages operate at resolutions 0 down to 1 and back, with skip connections preserving spatial detail.
- Transformation block: At each stage, the Swin-Transformer block is replaced by a DehazeFormer block which combines window-based multi-head self-attention (MHSA) and parallel depthwise convolutional aggregation.
- Skip fusion: Instead of naive concatenation, skip connections are merged via a Selective-Kernel (SK) fusion module, dynamically blending local and skip features.
- Soft reconstruction: The output is computed through learnable per-pixel maps 2 (scale) and 3 (bias), reconstructing the dehazed image by:
4
- Model variants: Depth, embedding dimension, and attention head count are specified per stage for Tiny (T), Small (S), Basic (B), Middle (M), and Large (L) variants. For example, the “S” model uses depth [8,8,8,4,4], dims [24,48,96,48,24], and heads [2,4,6,1,1].
3. Core Algorithmic Innovations
DehazeFormer diverges from prior transformer and CNN dehazing methods by introducing several optimizations shown through ablation studies to be critical for performance and efficiency:
- RescaleNorm: Contrary to per-token LayerNorm, RescaleNorm normalizes the full feature map, re-introducing batch mean 5 and std 6 after the transformer block with learnable affine transforms:
7
where 8 and 9, 0.
- Activation Function Selection: Unlike standard vision transformers (GELU), DehazeFormer employs ReLU or LeakyReLU, which are more invertible and empirically superior for dehazing tasks due to their monotonic and piece-wise linear character.
- Spatial Information Aggregation: Each attention block fuses conventional windowed MHSA with a parallel local convolution branch over all windows, explicitly aggregating neighborhood information inaccessible via pure attention—improving the restoration of fine-scale textures.
- Window Partitioning: Instead of Swin’s cyclic shift and mask (which induces edge token shrinking), DehazeFormer applies reflection padding before shifting, ensuring all windows are full-sized and spatial information is preserved even at boundaries.
- SK Fusion of Skip Connections: The SK module adaptively weights features from encoder and decoder streams, maintaining a balance between low-level detail and high-level semantics in the reconstructed output.
4. Training Protocols and Loss Functions
DehazeFormer employs a minimalistic yet effective training regime:
- Objective: Single pixel-wise 1 loss
2
- Datasets: Comprehensive evaluation on RESIDE-Full (ITS, OTS), RESIDE-6K (SOTS-mix), RS-Haze (remote sensing), and newly collected non-homogeneous haze benchmarks.
- Augmentation and optimization: Random cropping (typically 3), horizontal flipping (for outdoor), optimizer AdamW, batch size and learning rate variant-specific.
- No auxiliary losses: Unlike some contemporaries, DehazeFormer relies solely on 4 loss, eschewing perceptual or SSIM-based objectives.
5. Quantitative and Qualitative Performance
Experimental results on standard benchmarks demonstrate both the efficiency and raw restoration capability of DehazeFormer:
- SOTS Indoor (RESIDE ITS): DehazeFormer-L achieves 40.05 dB / 0.996 SSIM, the first model reported with PSNR > 40 dB; DehazeFormer-S yields 36.82 dB / 0.992, outperforming FFA-Net and AECR-Net but with only 525% of the parameters and 5% of the MACs.
- SOTS-mix and RS-Haze: Small/Basic models surpass FFA-Net and AECR-Net, establishing robustness on more challenging, mixed-source or remote sensing haze.
- Ablation findings: Each technical refinement—RescaleNorm, reflection-based shifting, ReLU activations, window+conv fusion—individually raises quantitative performance by 0.1–1.8 dB, demonstrating necessity and synergy.
- Qualitative restoration: DehazeFormer-S restores natural color, edge contrast, and clears distant-object haze more effectively than AOD-Net (color bias), GCANet (over-bright), PFDN/FFA (edge artifacts).
| Model | Params (M) | MACs (G) | PSNR (dB) | SSIM |
|---|---|---|---|---|
| FFA-Net | 4.46 | 287.8 | 36.39 | 0.989 |
| AECR-Net | 2.61 | 52.2 | 37.17 | 0.990 |
| DehazeFormer-S | 1.28 | 13.1 | 36.82 | 0.992 |
| DehazeFormer-B | 2.51 | 25.8 | 37.84 | 0.994 |
| DehazeFormer-L | 25.44 | 279.7 | 40.05 | 0.996 |
6. Comparison to Related Transformers and Implications
DehazeFormer is foundational for subsequent transformer-based dehazing systems and variants such as Semi-UFormer (Tong et al., 2022), which utilize its backbone as the dehazing module and extend its design for semi-supervised learning and uncertainty modeling. The DehazeFormer block (window-MHSA plus parallel convolution) is reused and adapted for physical-model-aware, high-resolution, and real-time transformers operating on diverse haze types. Its principles (window partitioning, fusion strategies, normalization) have informed subsequent advances in transformer-based low-level vision.
A plausible implication is that further relaxing transformer inductive biases, e.g., via purely global attention or by hybridizing with more advanced physical models, may yield even more general dehazing solutions. Yet DehazeFormer's deliberate alteration of high-level transformer structures underscores the necessity for domain adaptation and task-specific customization in vision transformer architectures.
7. Impact, Limitations, and Outlook
DehazeFormer represents the first transformer-based single-image dehazing network to set a new state of the art in PSNR/SSIM on both indoor and outdoor benchmarks while operating at a fraction of the compute and parameter cost. It demonstrates that vision transformers, suitably modified, achieve superior detail recovery and robustness in image restoration tasks historically dominated by CNNs.
Limitations include the reliance on supervised training with synthetic datasets, which may pose a domain gap on real-world images—a challenge that later frameworks such as Semi-UFormer (Tong et al., 2022) address via uncertainty-guided knowledge distillation and semi-supervised protocols. Additionally, the explicit dehazing formulation remains specialized; transferability to broader degradations (e.g., underwater, fog) or joint restoration/segmentation tasks is an active area of research.
DehazeFormer’s modular design and efficient implementation continue to underpin recent dehazing innovations, serving as a versatile template for transformer-based low-level vision architectures (Song et al., 2022, Tong et al., 2022).