Papers
Topics
Authors
Recent
Search
2000 character limit reached

SFANet: Window-Based Frequency/Channel Attention

Updated 29 January 2026
  • Window-based Frequency/Channel Attention is a neural architecture that partitions images into windows and applies frequency or channel focus to capture both local details and long-range dependencies.
  • It reduces computational costs of global self-attention while enhancing texture and structural fidelity using techniques like FFT-based attention and cyclic window shifts.
  • Implementations such as CPWin-SA, WFCA, and SCH demonstrate measurable performance improvements in PSNR and BD-rate across tasks like single image super-resolution, denoising, and learned image compression.

Window-based Frequency/Channel Attention (SFANet) refers to a class of neural architectures and modules that integrate windowed partitioning strategies with frequency and/or channel attention mechanisms to more effectively model both local and global dependencies in visual representation learning. This approach is now prevalent in single image super-resolution (SISR), image denoising, and learned image compression (LIC), where both spatial context and frequency information are crucial. Key techniques include window-based frequency-channel attention (WFCA), channel-partitioned windowed self-attention (CPWin-SA), and spatial-frequency interaction modules (SFIM), often combined with multi-branch and hybrid attention schemes leveraging both the spatial and frequency domains (Tran et al., 2024, Guo et al., 2023, Xu et al., 2024).

1. Theoretical Motivation and Fundamental Principles

Window-based attention mechanisms have been widely adopted as a means to balance computational efficiency and receptive field size. Traditional global self-attention exhibits quadratic complexity and is infeasible for high-resolution images. Windowing localizes computation, but can curtail the model's capacity to capture long-range dependencies and frequency relationships—a critical limitation for visual restoration and compression.

By integrating frequency- or channel-focused attention within each window, models can:

  • Leverage local windowing for efficient spatial modeling,
  • Embed frequency-domain or cross-channel priors for texture and structural fidelity,
  • Correct spatial-frequency mismatches that arise in naïve windowed pipelines,
  • Capture both short- and long-range dependencies through frequency mixing or channel-wise context aggregation.

These design choices exploit the regularities of natural images, where both localized features (edges, textures) and global or frequency-based properties (repetitions, harmonics) are prominent and informative.

2. Core Architectures and Module Design

Several instantiations of window-based frequency/channel attention architectures have been proposed, with details diverging by application and module composition.

CPWin-SA and SFIM for Super-Resolution

The Channel-Partitioned Windowed Self-Attention (CPWin-SA) block, as described in CPAT (Tran et al., 2024), divides input features along the channel dimension and applies three varieties of windowed self-attention:

  • Vertically-Enhanced Window Attention (V-EWin): Windows extend along spatial height (H×ws).
  • Horizontally-Enhanced Window Attention (H-EWin): Windows extend along width (ws×W).
  • Squared Window Multi-Head Self-Attention (W-MSA): Standard square windows (ws×ws).

A one-direction shift precedes partitioning to enable connection between adjacent windows. Outputs from all three attentions are concatenated and combined with depthwise convolutions and multi-layer perceptrons, injecting both non-local and local context.

Complementing this, the Spatial-Frequency Interaction Module (SFIM) processes features through parallel spatial and frequency branches. The frequency branch applies an FFT, transforms back via iFFT after local convolution and nonlinearity, and fuses with the original spatial features through additive and channel-reducing convolutions (Tran et al., 2024).

WFCA for Image Denoising

The Window-based Frequency Channel Attention (WFCA) block in SFANet for denoising (Guo et al., 2023) splits feature maps into non-overlapping windows and applies a 2D FFT to each window. Separate channel attention mechanisms are applied to the real and imaginary parts of the frequency-domain representation, realized via shared bottleneck MLPs. The reweighted frequency features are then transformed back to the spatial domain using iFFT and reassembled.

This block enables the explicit modeling of local frequency content and channel dependencies, yielding consistent PSNR improvements in texture-rich and benchmark denoising settings.

Window-based Channel Attention for Compression

In wavelet-enhanced LIC (Xu et al., 2024), the Space-Channel Hybrid (SCH) block alternates between local spatial attention (windowed self-attention) and window-based channel-attention. The latter transposes the standard attention mechanism so that channels serve as tokens and spatial dimensions as features; attention is performed independently within each spatial window. A discrete wavelet transform (DWT) at the input provides frequency-dependent down-sampling and further increases the effective receptive field by representing the input in LL, HL, LH, and HH subbands.

3. Mathematical Formulation and Workflow

A. Window-based Frequency Attention (WFCA):

Given x∈RB×C×H×Wx \in \mathbb{R}^{B \times C \times H \times W}, partition into non-overlapping N×NN \times N windows, apply per-window FFT: xf(b,c,u,v)=∑h=0N−1∑w=0N−1pn(b,c,h,w)e−j2π(uhN+vwN)x_f(b,c,u,v) = \sum_{h=0}^{N-1} \sum_{w=0}^{N-1} p_n(b,c,h,w) e^{-j2\pi\left(\frac{u h}{N} + \frac{v w}{N}\right)} Complex-valued features are processed via 1×11\times1 convolutions, separated into real and imaginary parts, and reweighted through channel attention generated by global average pooling and an MLP. Reweighted features are returned to the spatial domain via iFFT and reassembled (Guo et al., 2023).

B. Channel-Partitioned Windowed Self-Attention (CPWin-SA):

For each partition XkX_k, windowed self-attention is performed using learned projections (Q,K,V)(Q, K, V), followed by cyclic shifts, head-wise assembly, and linear projections. Full block computation includes residual, depthwise convolution, and MLP (with GELU) post-aggregation (Tran et al., 2024).

C. Window-based Channel Attention Mechanism:

Divide features into windows M×MM \times M, transpose spatial and channel axes, and apply attention over channels using query/key/value projections. Reconstruction involves channel positional encoding and an MLP. This formulation applies especially to image compression with frequency-preconditioned (DWT) input (Xu et al., 2024).

4. Empirical Performance, Complexity, and Ablation

Window-based frequency/channel attention consistently improves performance across image restoration and compression tasks. Representative results include:

Method/Module PSNR/SSIM Complexity (FLOPs) Dataset / Task Reference
CPAT (with SFIM, x2 SR) 34.76 dB / - - Urban100 / SISR (Tran et al., 2024)
HAT (prior SOTA) 34.45 dB / - - Urban100 / SISR (Tran et al., 2024)
SFANet (with WFCA) 31.67 dB / 0.9146 - Urban100 / Denoising (Guo et al., 2023)
Restormer 31.46 dB / 0.9121 - Urban100 / Denoising (Guo et al., 2023)
SCH (LIC, BD-rate) -18.54%...-24.71% - Multiple / Compression (Xu et al., 2024)

Selected ablations demonstrate that window-based methods outperform their full-image, plain convolutional, or non-windowed attention counterparts by 0.2–0.3 dB PSNR in textures and yield BD-rate improvements of over 20% in compression scenarios. Notably, the use of windowed frequency transforms mitigates full-image FFT mismatch and channel attention on both real and imaginary parts provides measurable benefit (Guo et al., 2023, Tran et al., 2024, Xu et al., 2024).

Windowing reduces the O(H4)O(H^4) cost of global attention to O(H3ws)O(H^3 ws) in vertical windowed schemes, or O(Pâ‹…C2)O(P \cdot C^2) for WFCA where PP is the number of windows and CC channels (Guo et al., 2023, Tran et al., 2024).

5. Applications and Domain-Specific Variations

Single Image Super-Resolution (SISR)

SFANet employing CPWin-SA and SFIM achieves superior performance over state-of-the-art on Urban100 at scale ×2. Enhanced window structures (combining vertical, horizontal, and square) outperform solely square windows, and the inclusion of the frequency branch in SFIM is critical, contributing up to +0.23 dB PSNR (Tran et al., 2024).

Image Denoising

Windowed WFCA in combination with UNet-scale channel width adaptation and alternating spatial–frequency modules yields improvements on noise benchmarks (Urban100, DTD, etc.), with the largest gains in texture-rich settings (Guo et al., 2023).

Learned Image Compression

Incorporation of window-based channel attention within wavelet-enhanced SCH blocks offers substantial BD-rate reductions and PSNR lifts over classical (VTM-23.1) and learned baselines (e.g., SENet, CBAM, DaViT) (Xu et al., 2024).

Generalization and Multi-Scale Strategies

Adopting different window sizes across UNet scales (e.g., N∈{64,32,16,8}N\in\{64,32,16,8\} for WFCA) and hybridizing spatial and channel attention yields robust multi-scale receptive field modeling, enhancing both local structure and global correlation capture.

6. Design Trade-offs, Hyperparameters, and Implementation

Critical design parameters include window size (trading off context vs. compute), number of heads, dimension splitting (for channel partitioning), and frequency branch configuration (FFT window size, MLP bottleneck ratio). In practice, default values such as ws=16ws=16, Nheads=6N_{heads}=6 (SR), and r=4r=4 (MLP bottleneck) have been shown effective (Tran et al., 2024, Guo et al., 2023).

Training pipelines typically employ large-scale data (e.g., DF2K for SR), L1L_1 or Charbonnier loss on relevant channels, Adam optimization, and test-time self-ensembling for extra gain. For image compression, rate–distortion objectives are standard, with λ\lambda values spanning RD curves (Tran et al., 2024, Guo et al., 2023, Xu et al., 2024).

Window-based frequency/channel attention modules generalize and outperform earlier attention constructs, such as global self-attention and spatial-only window attention, by accommodating frequency selectivity and channel interrelations—both of which are essential for high-fidelity image restoration and efficient compression. Alternative designs, such as full-image FFT or classical channel attention (SENet/CBAM), suffer from domain mismatch or limited receptive field, as evidenced by both ablation and quantitative results (Guo et al., 2023, Xu et al., 2024).

Discrete wavelet transforms further strengthen frequency separation by providing explicit multi-scale frequency bands and broad receptive fields at low computational cost, expanding applicability to artifact-sensitive domains such as learned image coding (Xu et al., 2024).


References:

  • "Channel-Partitioned Windowed Attention And Frequency Learning for Single Image Super-Resolution" (Tran et al., 2024)
  • "Spatial-Frequency Attention for Image Denoising" (Guo et al., 2023)
  • "Window-based Channel Attention for Wavelet-enhanced Learned Image Compression" (Xu et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Window-based Frequency/Channel Attention (SFANet).