Papers
Topics
Authors
Recent
2000 character limit reached

MAFNet: Multi-frequency Adaptive Fusion Network

Updated 6 December 2025
  • The paper introduces a novel adaptive frequency decomposition that separates high- and low-frequency components using FFT and learnable filters.
  • MAFNet employs low-rank attention and efficient convolutional backbones to fuse spatial, spectral, and representational features in a lightweight manner.
  • The network achieves state-of-the-art results in stereo matching (KITTI) and deblurring (GoPro) while reducing computational demands.

A Multi-frequency Adaptive Fusion Network (MAFNet) is a neural architecture paradigm that decomposes inputs into frequency-aware components, aggregates them via adaptive attention or gating, and fuses information to enhance spatial, spectral, or representational fidelity. Instances of MAFNet address high-speed stereo matching and image deblurring by leveraging frequency-domain filtering, low-rank attention mechanisms, and efficient convolutional backbones. Core to these variants is the explicit handling and fusion of high- and low-frequency representations, improving both accuracy and efficiency compared to spatial-only or conventional attention networks (Gao et al., 20 Feb 2025, Xu et al., 4 Dec 2025).

1. Core Network Principles and Frequency Decomposition

All MAFNet variants adopt the principle of adaptive fusion of feature subbands—usually dividing information into high-frequency (edges, details) and low-frequency (smooth, coarse features) bands. This separation is performed in the frequency domain, using either learnable low-pass filters, discrete Fourier transforms (FFT/RFFT), or similar operators. The explicit two-band splitting allows the network to address image regions with different statistical properties, such as sharp transitions versus homogeneous textures.

For example, in real-time stereo matching, the Adaptive Frequency-Domain Filtering Attention (AFFA) module performs a real-valued FFT on input feature maps. Radially parameterized soft masks, defined as

Mlow(u,v)=σ(τ(Tℓ−r(u,v))),Mhigh(u,v)=σ(τ(r(u,v)−Th)),M_{\mathrm low}(u,v) = \sigma\bigl(\tau(T_\ell - r(u,v))\bigr),\qquad M_{\mathrm high}(u,v) = \sigma\bigl(\tau(r(u,v) - T_h)\bigr),

(where r(u,v)r(u,v) is the normalized frequency and Ï„,Tâ„“,Th\tau, T_\ell, T_h are learnable) are applied in the frequency plane. The resulting masked features undergo inverse FFT, yielding spatial domain representations for low- and high-frequency bands (Xu et al., 4 Dec 2025).

Similarly, for image deblurring, the Frequency Domain Information Dynamic Generation Module (FDGM) applies learnable spatially-varying low-pass and high-pass filters via depth-wise convolution, producing adaptive per-row frequency subbands (Gao et al., 20 Feb 2025).

2. Architecture Design and Feature Fusion

In stereo matching, MAFNet employs a lightweight 2D convolutional encoder (MobileViT), constructs a 4D cost volume at 14\frac{1}{4} resolution via feature concatenation, and applies AFFA at multiple feature scales. The main computational innovation is the Linformer-based low-rank attention fusion (AFHF), which concatenates the band-filtered cost volumes and applies low-rank attention:

LinAtt(Q,K,V)=Softmax(QK′Td)V′\mathrm{LinAtt}(\mathbf Q, \mathbf K, \mathbf V) = \mathrm{Softmax}\left(\frac{\mathbf Q \mathbf K'^T}{\sqrt d}\right)\mathbf V'

with K′=EK\mathbf K' = \mathbf E \mathbf K, V′=FV\mathbf V' = \mathbf F \mathbf V, reducing attention complexity from O(N2)O(N^2) to O(Nk)O(Nk) (Xu et al., 4 Dec 2025). This fusion adaptively integrates contextual information across high- and low-frequency domains while enabling real-time inference.

For deblurring, MAFNet adopts a U-shaped encoder–decoder with multi-scale shallow and deep feature paths. Each stage contains cascaded MAFBlocks that jointly process spatial and frequency information. The Gated Fusion Module (GFM) internally re-weights spatial, low-, and high-frequency features via a gating mechanism involving global average and standard deviation pooling, then fuses them through a cross-attention mechanism with output weighting determined by learned softmax coefficients (Gao et al., 20 Feb 2025).

3. Loss Functions and Optimization Strategies

MAFNet designs for both deblurring and stereo matching employ multi-component loss objectives to ensure robust feature learning:

  • In stereo matching, supervision is applied via smooth L1L_1 loss on predicted disparities at both full and reduced resolutions:

L=λ0 LsmoothL1(D^0−Dgt)+λ1 LsmoothL1(D^1−Dgt).\mathcal L = \lambda_0\,\mathcal L^{\mathrm{L1}}_{\mathrm{smooth}}(\hat D^0-D^{gt}) + \lambda_1\,\mathcal L^{\mathrm{L1}}_{\mathrm{smooth}}(\hat D^1-D^{gt}).

Training uses AdamW optimizer with a one-cycle schedule, batch size 16, and data augmentation via random crops (Xu et al., 4 Dec 2025).

  • In deblurring, the loss per scale aggregates pixel-wise L2L_2 error, Laplacian error, and frequency domain difference in the FFT domain:

L=∑i=14(Lc(I^i,Iˉi)+δ Le(I^i,Iˉi)+λ Lf(I^i,Iˉi))L = \sum_{i=1}^4 \left( L_c(\hat I_i, \bar I_i) + \delta\,L_e(\hat I_i, \bar I_i) + \lambda\,L_f(\hat I_i, \bar I_i) \right)

where LfL_f compares FFTs, LeL_e captures edge differences via Laplacian, and LcL_c is the L2L_2-based content loss. Training uses Adam optimizer, cosine annealing learning rate, batch size 32, and spatial augmentations (Gao et al., 20 Feb 2025).

4. Quantitative Performance and Efficiency

MAFNet achieves competitive or superior results with lower computational demands due to its frequency-aware architectural design. In stereo matching, MAFNet achieves D1-all=1.82% on KITTI 2015 with only 39.4 G FLOPs—outperforming prior 2D-conv methods such as HITNet and AANet+ in both accuracy and efficiency (Xu et al., 4 Dec 2025). The inclusion of both AFFA and AFHF modules provides further improvements, as shown in ablation results.

In deblurring, MAFNet and its larger variant MAFNet-B set new SOTA on GoPro, HIDE, RealBlur-R, and RealBlur-J datasets. For instance, MAFNet-B attains 34.25 dB PSNR and 0.971 SSIM on GoPro, and 31.92 dB/0.949 on HIDE, exceeding comparable models including MPRNet, Restormer, and MR-LPFNet (Gao et al., 20 Feb 2025).

Method Dataset PSNR SSIM FLOPs (G) Params (M)
MAFNet KITTI'15 — — 39.4 10.36
MAFNet-B GoPro 34.25 0.971 — —
HITNet KITTI'15 — — 50.23 0.42
MR-VNet GoPro 34.04 0.969 — —

Performance gains are consistently attributed to the explicit handling and fusion of frequency-specific information.

5. Advantages, Limitations, and Future Directions

The primary advantage of the MAFNet approach is its ability to decouple, adaptively weight, and efficiently fuse frequency-domain information using computationally lightweight filters and low-rank attention, obviating the need for 3D convolutions. This results in models suitable for deployment on resource-constrained (e.g., mobile or embedded) platforms without substantial loss of accuracy.

However, the two-band decomposition is coarse; finer or dynamically learned frequency partitions may offer further improvements in detail preservation and selective enhancement (Xu et al., 4 Dec 2025). Quantization and pruning could further lower runtime cost. Extending adaptive fusion to true multi-scale spatial–frequency representations remains an active area for exploration.

A plausible implication is that the MAFNet paradigm is generalizable: networks incorporating adaptive frequency decomposition and fusion outperform spatial-only and frequency-blind models in vision tasks characterized by simultaneous smooth region and edge detail processing requirements.

6. Relationship to Broader Frequency-Aware and Attention Models

MAFNet differentiates itself from conventional channel or spatial attention mechanisms by explicitly modeling frequency composition via learnable or analytic transforms. The Linformer-based fusion used in MAFNet allows global contextual integration with sub-quadratic computational cost, addressing the limitations of traditional self-attention in high-dimensional cost volumes (Xu et al., 4 Dec 2025). In image deblurring, joint spatial-frequency gating and cross-attention improve the learning of complementary features versus approaches that fuse domains post-hoc or with static filters (Gao et al., 20 Feb 2025).

This approach is related in spirit to networks employing wavelet attention (Huang et al., 7 Feb 2025), where discrete wavelet transforms and frequency-specific attention mechanisms yield improvements for pansharpening and remote sensing tasks. The consistency in adopting explicit frequency band splitting and adaptive band fusion across domains suggests that Multi-frequency Adaptive Fusion Networks are a robust unifying motif in recent vision architectures.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-frequency Adaptive Fusion Network (MAFNet).