Papers
Topics
Authors
Recent
Search
2000 character limit reached

UDPNet: Depth-Guided Image Dehazing

Updated 18 January 2026
  • The paper introduces UDPNet, which injects fixed, pretrained depth priors via a U-shaped network to achieve notable PSNR improvements.
  • Methodology centers on a Depth-Guided Attention Module (DGAM) and a Depth Prior Fusion Module (DPFM) that fuse RGB and depth features using efficient cross-attention.
  • Empirical results show state-of-the-art dehazing across diverse benchmarks with minimal overhead (~0.3M additional parameters) and real-time inference (~30 fps).

UDPNet is a general framework for robust image dehazing that leverages high-quality depth-based priors to enhance standard deep learning dehazing models. The method systematically integrates depth information, predicted by the large-scale pretrained DepthAnything V2 model, into a U-shaped dehazing backbone through two lightweight modules: the Depth-Guided Attention Module (DGAM) and the Depth Prior Fusion Module (DPFM). UDPNet achieves state-of-the-art results on multiple benchmarks while maintaining computational efficiency and real-time performance (Zuo et al., 11 Jan 2026).

1. Motivation and Core Innovations

Conventional image dehazing architectures predominantly rely on single-modal RGB information, often neglecting the direct correlation between haze distribution and scene depth, as articulated by the Atmospheric Scattering Model: t(x,y)=e−βd(x,y)t(x,y)=e^{-\beta d(x,y)}. Prior efforts to jointly estimate depth and recover clean images typically employ noisy online depth learning or simplistic feature concatenation, resulting in only marginal performance gains.

UDPNet addresses these limitations by introducing a paradigm that unleashes accurate, robust geometric priors from DepthAnything V2—a multi-domain large-scale pretrained depth estimation model—without introducing extra online learning noise. The proposed architecture incorporates these priors using:

  1. A frozen DepthAnything V2 module that provides depth prior maps for any input while remaining fixed during both training and inference.
  2. The Depth-Guided Attention Module (DGAM), facilitating computationally efficient channel-wise RGB-depth fusion at the input level.
  3. The Depth Prior Fusion Module (DPFM), which implements hierarchical, cross-modal feature fusion via a dual sliding-window multi-head cross-attention mechanism at every encoder stage.

These innovations enable UDPNet to generalize across synthetic and real data, daytime and nighttime settings, and diverse haze densities, setting new performance benchmarks.

2. Architecture: Modules and Integration

UDPNet integrates depth-based priors through explicit architectural augmentations:

2.1 Depth-Guided Attention Module (DGAM)

DGAM is positioned at the network's head to produce depth-aware shallow features with minimal additional computational cost (~0.1 M parameters). Its operations include:

  • Depth refinement: The single-channel raw depth map D∈R1×H×WD \in \mathbb{R}^{1 \times H \times W} predicted by DepthAnything V2 is processed by a 3-layer Conv-ReLU network to yield a refined depth map D^\hat{D}.
  • RGB-D projection: The input image II (3 channels) is concatenated with D^\hat{D} and passed through a 3×33 \times 3 convolution followed by InstanceNorm and GELU activation to produce initial features FconvF_{conv}.
  • Depth-guided channel attention: Global average pooling yields a channel descriptor, transformed by two fully-connected layers with GELU and sigmoid activations to produce per-channel weights. These weights modulate the concatenated features:

FDGAM=[I,D^]⊙σ(FC2(GELU(FC1(GAP(Fconv)))))F_{DGAM} = [I, \hat{D}] \odot \sigma(\mathrm{FC}_2(\mathrm{GELU}(\mathrm{FC}_1(\mathrm{GAP}(F_{conv})))))

DGAM surpasses naive depth concatenation by ~0.2 dB PSNR on Haze4K.

2.2 Depth Prior Fusion Module (DPFM)

DPFM hierarchically fuses intermediate image and depth features at each encoder scale. Key components:

  • Overlapping window attention: Image features X∈RH×W×CX \in \mathbb{R}^{H \times W \times C} and corresponding depth features FDF_D are partitioned into overlapping windows, with non-overlapping queries and overlapping keys/values of size Mov=(1+r)MM_{ov} = (1 + r) M.
  • Dual cross-attention branches: Within each window, two attention mechanisms operate:
    • Depth-guided: queries from depth, keys/values from image.
    • Image-guided: queries from image, keys/values from depth.
  • Aggregation and output: Aggregated attention outputs are merged, followed by feedforward and normalization layers with residual connections:

Yw=AttnD→X+AttnX→D+XwY_w = \mathrm{Attn}_{D \to X} + \mathrm{Attn}_{X \to D} + X_w

This module confers ~0.2 M parameters per DPFM block and contributes ~0.2–0.3 dB PSNR gain per branch. Using two attention heads is optimal.

3. DepthAnything V2 Priors: Extraction and Deployment

DepthAnything V2 serves as a fixed, general-purpose depth prior extractor. For each input image II, it predicts a single-channel depth map DD. During both training and inference, DepthAnything V2 is not fine-tuned; instead, its frozen predictions are leveraged as follows:

  • DD is refined and projected via DGAM at the network head.
  • In each encoder stage, the raw DD augments image features via DPFM, ensuring depth priors inform feature extraction from coarse to fine granularity.

UDPNet explores small, base, and large student variants of DepthAnything V2, but maintains all model weights frozen, limiting overfitting and enhancing cross-domain robustness.

4. Computational Complexity and Resource Analysis

UDPNet introduces only ≈\approx0.3 M additional parameters over typical 13 M baselines (FSNet/ConvIR) and preserves linear computational complexity in the number of pixels (O(N)O(N), where N=HWN = HW). Specifically:

  • DGAM: +0.1 M parameters (FC layers, 3×33 \times 3 conv).
  • Each DPFM: ~0.05 M parameters for Q/K/V projections and bias.
  • Three DPFM stages: ~0.15 M total.
  • Overall delta: ≈\approx0.3 M parameters.

For a 256×256256 \times 256 input, end-to-end dehazing achieves ∼\sim30 fps on an RTX 3090, indicating minimal runtime overhead compared to baselines.

5. Empirical Validation and State-of-the-Art Results

UDPNet demonstrates consistent performance improvements across multiple benchmark datasets, using PSNR, SSIM, and LPIPS as evaluation metrics on both synthetic and real-world images. Key quantitative gains include:

Dataset Baseline Method Baseline PSNR (dB) UDPNet PSNR (dB) PSNR Gain (dB)
SOTS-Indoor ConvIR 42.72 43.12 +0.40
SOTS-Indoor FSNet 42.45 43.30 +0.85
Haze4K FSNet 34.12 35.31 +1.19
NHR (nighttime) FSNet 26.30 29.54 +3.24
SateHaze1k-Thick ConvIR-S 22.65 22.95 +0.30
SateHaze1k-Thick PoolNet-S 22.73 23.13 +0.40

UDPNet models establish a new performance baseline across indoor, outdoor, nighttime, and remote sensing scenarios, with gains of 0.3–1.8 dB PSNR and consistent improvements in SSIM/LPIPS. Notably, nighttime dehazing sees pronounced benefits due to the combined effect of DGAM, DPFM, and low-light integration.

6. Ablation Studies and Module-wise Impact

Extensive ablation studies on Haze4K (FSNet backbone, 1000 epochs) reveal the following incremental effects:

Configuration PSNR (dB) Relative Gain (dB)
Baseline (no depth) 34.12 0.00
+ Naive depth concatenation 34.78 +0.66
+ DGAM only 34.96 +0.84
+ DPFM (single branch, depth→RGB CCA) 34.93 +0.81
+ DPFM (single branch, RGB→depth CCA) 35.06 +0.94
+ DPFM (depth→RGB spatial SCA) 35.10 +0.98
+ DPFM (RGB→depth spatial SCA) 35.13 +1.01
+ Combined both spatial branches (encoder) 35.22 +1.10

CCA: Channel Cross-Attention, SCA: Spatial Cross-Attention

DGAM alone confers an ~+0.8 dB gain; each DPFM branch adds ~+0.2–0.3 dB, while combining both spatial branches yields an additional +0.3 dB. Two attention heads per DPFM yields optimal results.

7. Significance, Robustness, and Practical Implications

UDPNet demonstrates that fixed, pretrained large-scale geometric priors can be seamlessly and efficiently injected into deep dehazing pipelines through lightweight, plug-and-play modules. Depth-guided channel attention (DGAM) sharpens feature selection at the input, and dual sliding-window cross-attention (DPFM) ensures geometric priors direct the feature extraction process throughout the encoder hierarchy.

This architecture enables robust, real-time image dehazing that generalizes across domains, illumination regimes, and haze densities. It achieves higher PSNR and perceptual consistency than prior approaches at negligible computational overhead, and establishes new benchmarks on standard datasets such as SOTS, Haze4K, and NHR (Zuo et al., 11 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UDPNet.