WaveDM: Wavelet Diffusion Models
- WaveDM is a family of models that use wavelet-domain representations combined with diffusion processes to enable efficient image restoration and signal processing.
- The approach decomposes images into low- and high-frequency bands via a multi-level 2D wavelet transform, which reduces computational load and allows for frequency-specific processing.
- It integrates a conditional diffusion process for low-frequency recovery with a high-frequency refinement module, achieving state-of-the-art performance and significant inference speed improvements.
WaveDM refers to a family of models and methodologies that leverage wavelet-domain representations in combination with diffusion processes or related signal processing architectures for efficient, high-fidelity data modeling, restoration, and communication. Across applications including image restoration, signal multiplexing, physical modeling, and dynamic system identification, WaveDM exploits the multiresolution structure inherent to the wavelet transform to accelerate computation, specialize processing by frequency band, or enhance statistical efficiency. Here, the primary focus is the wavelet-based diffusion model for image restoration as exemplified in "WaveDM: Wavelet-Based Diffusion Models for Image Restoration" (Huang et al., 2023).
1. Core Principles and Model Formulation
WaveDM frames image restoration as conditional generation of clean images in the wavelet domain, given the degraded image’s wavelet spectrum. The method applies a multi-level 2D discrete wavelet packet transform (FWPT) to the input, decomposing images into low- and high-frequency spectral components:
- Low-frequency bands (): Carry coarse structural and color information.
- High-frequency bands (): Encode texture, edges, and fine details.
The model trains:
- A conditional diffusion process over the low-frequency coefficients, effectively modeling where is the degraded spectrum, using denoising probabilistic diffusion as in DDPM.
- A high-frequency refinement module (HFRM) to estimate high-frequency components in a single forward pass, leveraging the redundancy and sparsity of these bands for computational efficiency.
Reconstruction of the full image is achieved by inverse FWPT, concatenating the generated/recovered low and high-frequency bands.
2. Wavelet Domain Representation and Processing
Adopting a multi-level 2D wavelet transform, WaveDM converts an RGB image at level into subbands of size , typically using the Haar basis:
- , .
The spectrum is partitioned:
- , , with and for .
This decomposition confers several advantages:
- Reducing computational cost: The diffusion model operates on a $1/16$ spatial area (for ), translating to inference acceleration per step.
- Separation of structure and detail: Enables tailored neural architectures for low-vs-high frequency processing.
3. Conditional Diffusion and Frequency-Specific Modules
3.1. Diffusion Process in the Wavelet Domain
Noising is applied only to the low-frequency (structural) bands:
for , where usually , and the full noise schedule is defined via .
Reverse denoising, conditional on degraded input and refined high-frequency estimate , is performed by a U-Net:
3.2. High-Frequency Refinement
The HFRM (High-Frequency Refinement Module) is a compact U-Net mapping (degraded spectrum) directly to a refined estimate of (high frequencies) in a single pass:
3.3. Training Objective
Joint optimization of diffusion and refinement modules:
Training uses iterations, Adam optimizer (lr ), and schedule in .
4. Efficient Conditional Sampling (ECS) and Inference Acceleration
ECS is an inference strategy that leverages the rapid convergence of low-frequency restoration in the wavelet domain:
- After a moderate number of DDIM (Denoising Diffusion Implicit Models) steps (typically out of ), the denoised low-frequency bands are sufficiently accurate when conditioned on both the degraded wavelet spectrum and HFRM output.
- The algorithm then skips to direct calculation of the final clean low-frequency coefficients using a closed-form denoising formula:
where .
- Output is reconstructed by inverse FWPT: .
Empirically, ECS achieves comparable or superior results to full 25-step DDIM chains, with as few as 4–8 steps in high-resolution restoration.
5. Architecture and Complexity Analysis
| Module | Base Channels | Channel Multipliers | Key Features |
|---|---|---|---|
| HFRM | 32 | 1,2,4,8,16 | 1 residual block/scale; 48-in, 45-out channels |
| NEN | 128 | 1,1,2,2,4,4 | 2 residual blocks/scale, attention, 512-D t-embed, inputs: , , (96 ch) |
- Operating in the wavelet domain reduces U-Net FLOPs per step by compared to pixel-wise DDPM on the full image.
- Empirical timings on a 720480 image: PatchDM (25 steps, patch-wise)  s, one-pass SOTA CNN  s, WaveDM (8 ECS steps)  s.
6. Quantitative and Qualitative Results
WaveDM attains state-of-the-art or superior performance on multiple restoration benchmarks across tasks such as raindrop removal, rain-streak removal, dehazing, defocus deblurring, demoiréing, and denoising. Representative PSNR/SSIM/time performance metrics (selected from Table 1 (Huang et al., 2023)):
| Task | One-pass best (PSNR/SSIM/Time) | PatchDM (25 steps) | WaveDM (ECS) |
|---|---|---|---|
| Raindrop | 31.87/0.931/0.39 s | 32.31/0.946/301 s | 32.25/0.948/0.30 s |
| Dehazing | 34.95/0.984/0.14 s | 35.52/0.989/19.3 s | 37.00/0.994/0.15 s |
| Real denoise | 40.02/0.960/0.114 s | 39.86/0.959/9.33 s | 40.38/0.962/0.062 s |
For Gaussian denoising at , e.g., Set12 dataset: WaveDM achieves 28.44 dB, outperforming MWCNN, SwinIR, and Restormer.
WaveDM also yields high-quality qualitative improvements: sharper texture, more accurate edge recovery across all tasks. Figures in the source paper depict preservation of details and texture not observed in competing methods.
7. Limitations, Open Challenges, and Future Directions
- Training cost: Full training requires several days on multi-GPU clusters for 2 million iterations.
- Potential improvements:
- Knowledge distillation or preconditioned architectures to accelerate convergence.
- Exploring alternative wavelet bases (e.g., Daubechies, biorthogonal) for domain-specific tailoring.
- Extending the framework to video or more complex inverse problems.
- Wavelet selection and resolution: The trade-off between compression (higher ) and preservation of spatial detail remains a hyperparameter to optimize.
- Current scope: The architecture is primarily optimized for image restoration; direct extensions to segmentation or generation are prospective.
WaveDM demonstrates that shifting conditional denoising diffusion to the wavelet domain and employing a dual-module approach with efficient conditional sampling can achieve high restoration fidelity and inference speed, with more than improvement over vanilla, patch-based diffusion models (Huang et al., 2023).