MADNet: Multi-scale Adaptive Dual-domain Network
- MADNet is a multi-scale and dual-domain network for image denoising that leverages adaptive frequency and spatial processing to restore details from noisy images.
- It integrates pyramid inputs, dual-domain modulation layers, and global feature fusion to systematically capture and enhance features at various resolutions.
- Empirical results demonstrate that MADNet outperforms state-of-the-art methods on synthetic and real-image benchmarks, achieving improved PSNR and SSIM metrics.
MADNet denotes a set of architectures and methodologies introduced under the acronym "Multi-scale Adaptive Dual-domain Network" in the context of image denoising. The principal MADNet design (Zhao et al., 19 Jun 2025) targets the limitations of conventional single-scale, single-domain denoisers by leveraging multi-scale representations and adaptive frequency-based feature manipulation to achieve end-to-end restoration of noisy images with state-of-the-art fidelity on synthetic and real datasets.
1. Motivation and Problem Formulation
MADNet was developed to address two fundamental limitations observed in traditional U-Net–style denoising networks. First, standard architectures operate with single-scale (full-resolution) image input/output mappings, thereby underutilizing the inherent decrease in high-frequency noise observed in down-sampled image versions. Second, previous frameworks treat the frequency domain uniformly, neglecting the divergent characteristics and perceptual implications of high-frequency (texture/grain) versus low-frequency (smooth/bias) degradations. MADNet explicitly incorporates a pyramid of multi-scale inputs and decomposes image features to adaptively exploit both spatial and frequency distinctions, enabling targeted restoration of diverse noise patterns (Zhao et al., 19 Jun 2025).
Empirical observations confirm that lower-resolution versions of a noisy image have measurably less high-frequency corruption. For example, mean squared error decreases and SSIM increases at coarser scales for the same noise strength, indicating the value of multi-scale contexts (Table 1).
| Scale Factor | MSE | PSNR | SSIM |
|---|---|---|---|
| 1.0 | 2.9e-4 | 35.23 | 0.83 |
| 0.5 | 1.7e-4 | 37.61 | 0.90 |
| 0.25 | 1.6e-4 | 37.74 | 0.92 |
| 0.125 | 1.5e-4 | 38.06 | 0.94 |
2. Network Architecture and Key Components
MADNet adopts a multi-input, multi-output U-Net backbone operating at four pyramid scales (1×, 2×, 4×, 8×). The architecture consists of:
- Multi-scale Pyramid Inputs: The noisy input image is down-sampled to create {x₀, x₁, x₂, x₃}, each at a different resolution. Shallow convolutional layers extract features Fᵢ, with the number of channels scaled to the scale factor.
- Encoder-Decoder with Dual-Domain Modulation: Each scale passes through a sequence of Dual-Domain Modulation Layers (DDMLs), which combine spatial and frequency processing. Downsampling and upsampling operations allow feature exchange between scales, and skip connections—enhanced through global context fusion—preserve high-resolution structural information.
- Residual Prediction at Each Scale: For each scale, a convolutional layer regresses the residual sᵢ between the input and its denoised version ŷᵢ = xᵢ − sᵢ.
3. Dual-Domain Modulation Layer & Adaptive Spatial-Frequency Learning Unit
The DDML is the operative block for spatial-frequency feature extraction. Its core innovation is the Adaptive Spatial-Frequency Learning Unit (ASFU), which directly addresses the adaptive disentangling and enhancement of high- and low-frequency content.
- Adaptive Frequency Enhancement Block (AFEB): Employs a learnable channel-wise mask to separate the Fourier-transformed features into high- and low-frequency components. Each branch is independently enhanced using inverse FFT, multilayer perceptron (MLP), and a transposed self-attention (TSA) mechanism before recombination.
- Adaptive Spatial Enhancement Block (ASEB)/TSA: Performs channel-mixed self-attention in spatial coordinates, further refining the extracted features especially after frequency domain processing.
Mathematically, the mask M is generated per patch via a learned 1×1 convolution followed by a sigmoid, and frequency splitting is achieved as F_high = M ⊙ Fₓ, F_low = (1 − M) ⊙ Fₓ, with both branches passed specifically through their spatial enhancement routes before summing.
4. Global Feature Fusion Block and Multi-Scale Integration
Global Feature Fusion Block (GFFB) is integrated into skip connections to aggregate deep representations across all scales before propagating to the decoder at each level. Features from all scales are concatenated, and ASEB is exploited to merge global context into each individual scale's pathway, which enhances consistency and allows precise information routing between fine and coarse representations.
The architecture supports comprehensive feature propagation, resulting in robust spatial-frequency recovery. The network design mandates filter/stride/activation configurations scaling systematically with pyramid depth (e.g., C=64 base channels, doubling with scale), and uses conventional downsampling/upsampling operators.
5. Training Procedures and Loss Functions
Training for MADNet utilizes a composite loss that synergizes spatial-domain and frequency-domain fidelity. For scale s, the spatial loss is a Charbonnier penalty:
while the frequency loss is an ℓ₁ norm over the FFTs of outputs and targets:
The total loss is . Training employs Adam, heavy data augmentation, and a learning rate decayed by schedule or cosine annealing depending on training set type. Synthetic and real denoising scenarios are both covered.
6. Empirical Performance and Ablation Analysis
On standard benchmarks with synthetic additive white Gaussian noise (e.g., CBSD68, Set12 at σ=50), MADNet achieves up to 0.18 dB PSNR gains over the nearest competitor, preserving textures and suppressing artifacts more effectively than DRANet or MSANet.
| Method | PSNR (CBSD68, σ=50) | SSIM |
|---|---|---|
| MSANet | 28.32 | 0.804 |
| DRANet | 28.37 | 0.806 |
| MADNet | 28.41 | 0.807 |
On real-sensor noise datasets (SIDD, DND), MADNet attains superior metrics (e.g., PSNR 39.98 and SSIM 0.956 on DND), outperforming DRANet and all prior SOTA at a moderate computational cost (28.4M params, 48.2ms/image at 256×256).
Ablation studies demonstrate that both multi-scale inputs and global feature fusion provide ~0.06 dB gains each, while separating/enhancing frequency bands via the mask-based AFEB yields ~0.1 dB gain. The dual-domain modules (AFEB + ASEB) synergistically outperform either alone, confirming the necessity of the adaptive decomposition.
7. Model Complexity, Limitations, and Prospective Directions
Although MADNet's FLOP count is higher than some alternatives (91.35G vs. 35.35G for MSANet), its parameter size and runtime remain within the range suitable for high-quality denoising tasks. Its multi-scale, dual-domain structure achieves better performance at the cost of increased computational effort per image. Future optimizations are suggested towards reducing parameter overhead without sacrificing denoising quality.
In summary, MADNet collectively integrates multi-scale pyramidal processing, adaptive high/low-frequency splitting and enhancement, and global context fusion to achieve state-of-the-art denoising across synthetic and real imaging scenarios, providing significant improvements in perceptual and quantitative restoration accuracy (Zhao et al., 19 Jun 2025).