Papers
Topics
Authors
Recent
Search
2000 character limit reached

FMix: Frequency-Domain Augmentation & Fusion

Updated 21 January 2026
  • FMix is a frequency-domain approach that uses FFT to generate smooth, sample-specific binary masks and learnable feature fusion blocks for data augmentation and denoising.
  • It employs thresholded Fourier-space noise and adaptive frequency weighting to preserve local details while enhancing mask diversity and signal filtering.
  • Empirical results across image, audio, and 3D tasks demonstrate FMix's superior performance and efficiency compared to spatial-domain methods like MixUp and CutMix.

FMix is an umbrella term for two distinct modules in deep learning, both based on operating in the frequency domain via the Fast Fourier Transform (FFT): (1) a Mixed Sample Data Augmentation (MSDA) method that produces sample-specific, contiguous binary masks by thresholding low-frequency images formed from Fourier-space noise, and (2) a learnable frequency-domain feature fusion block for image restoration and denoising. FMix, in both senses, leverages the ability of the Fourier domain to modulate information at different spatial scales, enabling mechanisms for either data mixing or feature gating that outperform prior spatial-domain approaches on several vision, audio, and 3D tasks.

1. Mask-Based Data Augmentation: Formal Definition and Construction

FMix as an MSDA method generates random binary masks by thresholding smooth images sampled from the Fourier domain. Given input samples xi,xjRC×H×Wx_i, x_j \in \mathbb R^{C \times H \times W} and a mixing coefficient λ[0,1]\lambda \in [0,1], FMix defines the augmented instance as

x~=Mxi+(1M)xj,\tilde{x} = M \odot x_i + (1 - M) \odot x_j,

where M{0,1}H×WM \in \{0,1\}^{H \times W} is a mask of mean λ\lambda, and \odot is channel-wise broadcasting. The mask is constructed as follows:

  1. Fourier Noise Sampling: A complex white-noise field in the frequency domain, ZN(0,I)+iN(0,I)Z \sim \mathcal N(0,I) + i \mathcal N(0,I), is amplitude-modulated by a decay factor over frequencies:

Z^[u,v]=Z[u,v](freq[u,v])δ,\widehat{Z}[u,v] = \frac{Z[u,v]}{(\mathrm{freq}[u,v])^{\delta}},

where freq[u,v]\mathrm{freq}[u,v] is the 2\ell_2 frequency at bin (u,v)(u,v) and δ>0\delta > 0 decays higher spatial frequencies. The inverse FFT and real-part extraction yields a spatially smooth mask candidate:

G(x,y)=(F1{Z^}(x,y)).G(x,y) = \Re\left(\mathcal F^{-1}\{\widehat{Z}\}(x,y)\right).

  1. Thresholding: The final binary mask MM is obtained by selecting the top λHW\lambda HW pixels of GG:

M(x,y)={1,if G(x,y)topλHW(G) 0,otherwiseM(x,y) = \begin{cases} 1, & \text{if } G(x,y) \in \mathrm{top}_{\lambda HW}(G) \ 0, & \text{otherwise} \end{cases}

  1. Target Mixing: Targets are mixed using the same λ\lambda:

y~=λyi+(1λ)yj.\tilde{y} = \lambda y_i + (1 - \lambda) y_j.

Pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
def FMixMask(shape, λ, δ):
    Z_real = Normal(0,1, size=shape)
    Z_imag = Normal(0,1, size=shape)
    Z = Z_real + 1j * Z_imag
    freq = compute_frequency_magnitude_grid(shape)
    Z_hat = Z / (freq**δ + ε)
    G = np.real(np.fft.ifft2(Z_hat))
    K = round(λ * np.prod(shape))
    τ = np.partition(G.ravel(), -K)[-K]
    M = (G > τ).astype(float)
    return M
(Harris et al., 2020)

2. Theoretical Properties and Comparison with Prior MSDAs

FMix is positioned in contrast with MixUp and CutMix, two other MSDA paradigms:

  • MixUp linearly interpolates all pixels (x~=λxi+(1λ)xj)(\tilde x = \lambda x_i + (1-\lambda) x_j), reducing mutual information between learned features and the raw data. This compression can increase adversarial robustness (e.g., to DeepFool and uniform noise) but prevents encoding of spatially local features.
  • CutMix employs axis-aligned rectangular masks, preserving local sample-specific content and mutual information, but is limited in the spatial variety of augmentations.
  • FMix generalizes spatial masking to arbitrary contiguous blobs, vastly increasing mask diversity while preserving high local mutual information, thereby affording both regularization and retention of discriminative features. Unlike MixUp, FMix does not induce systematic compression of latent representations.

Empirical studies using mutual information via VAE-based analysis confirm these theoretical distinctions, attributing superior generalization and stability—particularly under spatial-domain corruption and distribution shift—to FMix-augmented models (Harris et al., 2020).

3. Empirical Performance and Modal Diversity

FMix achieves strong accuracy gains across vision, audio, and point cloud tasks without additional training cost relative to MixUp or CutMix. Notable empirical results:

  • CIFAR-10 (PreAct-ResNet18): Baseline: 94.63 ± 0.21, MixUp: 95.66 ± 0.11, CutMix: 96.00 ± 0.07, FMix: 96.14 ± 0.10.
  • ImageNet (ResNet-101, 90 epochs): Baseline: 77.28%, MixUp: 75.89%, CutMix: 76.92%, FMix: 77.70%.
  • Bengali Graphemes: Accuracy improves from 87.60% (baseline) to 91.87% (FMix).
  • Audio (Google Commands): FMix attains 98.59%.
  • 3D/PointNet: FMix increases ModelNet10 accuracy from 89.10% to 89.57%. In most regimes, hybrid policies alternating FMix with MixUp further improve generalization, e.g., PreAct-ResNet18 on CIFAR-10: FMix+MixUp achieves 96.30%.

FMix demonstrates applicability in 1D (spectrograms, sentiment), 2D (image), and 3D (voxel) input spaces (Harris et al., 2020).

4. Core Hyperparameters, Ablation, and Implementation

Performance is sensitive to the Fourier decay parameter δ\delta, controlling mask smoothness (blob size). On CIFAR-10, δ2\delta \gtrsim 2 is optimal; δ<2\delta < 2 precipitates mask noise and performance degradation, with a stable peak near δ3\delta \approx 3. The mixing ratio prior αα in Beta(α,α)\mathrm{Beta}(α,α) for λ\lambda is robust, defaulting to α=1α = 1. Thresholding enforces exact average mask size per sample.

FMix incurs no extra wall-clock overhead compared to CutMix and MixUp. Code for standard deep learning frameworks is published at https://github.com/ecs-vlc/FMix (Harris et al., 2020).

5. Frequency-Domain Feature Filtering: FMix in Multi-View Denoising

A distinct FMix module appears in multi-view denoising networks, notably within Context Receptance Blocks (CRB) for image restoration (Chen et al., 5 May 2025). Here, FMix is a learnable frequency-selective block that processes an input feature tensor xRh×w×cx \in \mathbb R^{h \times w \times c} via:

  1. 2D FFT per Channel:

xu,v,cF=m=0h1n=0w1xm,n,cexp(2πi(um/h+vn/w))x^{F}_{u,v,c} = \sum_{m=0}^{h-1} \sum_{n=0}^{w-1} x_{m,n,c} \exp(-2\pi i (u m/h + v n/w))

  1. Learned Frequency Weighting + LeakyReLU:

zu,v,c=LeakyReLU(Wcxu,v,cF+bc)z_{u,v,c} = \mathrm{LeakyReLU}(W_c \cdot x^F_{u,v,c} + b_c)

with WCc×cW \in \mathbb{C}^{c \times c} (typically diagonal), bCcb \in \mathbb{C}^c.

  1. Inverse 2D FFT to Spatial Domain:

y^m,n,c=1hwu=0h1v=0w1zu,v,cexp(2πi(um/h+vn/w))\hat{y}_{m,n,c} = \frac{1}{h w} \sum_{u=0}^{h-1}\sum_{v=0}^{w-1} z_{u,v,c} \exp(2\pi i (u m/h + v n/w))

  1. Spectral–Spatial Mixing, Normalization, and Residual:

FMix(x)=Norm(y^x)\mathrm{FMix}(x) = \mathrm{Norm}(\hat{y} \odot x)

Within each CRB, FMix(x) is combined with xx via residual skip connection: z1=Norm(FMix(x))+α1xz_1 = \mathrm{Norm}(\mathrm{FMix}(x)) + \alpha_1 x.

The design ensures that (i) frequency-domain weights act as learnable band-pass filters (especially accentuating high/mid frequencies to suppress noise), (ii) Fourier-space selectivity is achieved without additional gating (e.g., no sigmoid/softmax filters), and (iii) spectral–spatial Hadamard product reduces the influence of structurally incoherent features.

Pseudocode:

1
2
3
4
5
6
def FMix(x):
    Xf = FFT2D(x)
    Zf = LeakyReLU(LinearFreq(Xf))
    y_hat = iFFT2D(Zf)
    output = Norm(y_hat * x)
    return output
(Chen et al., 5 May 2025)

6. Applications, Empirical Impact, and Limitations

FMix-augmentation improves generalization and robustness across diverse settings, achieving state-of-the-art results on vision and sequential benchmarks at no additional computational cost. In image denoising, frequency-domain FMix modules enable explicit attenuation of noise-dominated spectral bands, leading to quantitative gains and inference-time reductions (by up to 40%) in real-world, high-noise scenarios (Chen et al., 5 May 2025).

FMix is not designed to enhance worst-case adversarial robustness, since its focus is on preserving the empirical data distribution, not perturbing it. Applications in self-supervised and contrastive learning are plausible areas for further research. Extending FMix via learned Fourier decay schedules and data-dependent frequency priors is an open direction.

7. Integration, Extensions, and Future Prospects

The FMix family can be integrated with policy-driven augmentation frameworks (e.g., AutoAugment, RandAugment, AugMix) and hybridized with MixUp, capitalizing on their orthogonal impacts on model invariance and representation compression. Open questions include optimal scheduling for mask frequency content, adaptive augmentation for small data regimes, and harnessing FMix in unsupervised/latent variable models.

In summary, FMix provides principled, frequency-domain augmentation and feature-filtering tools that systematically exploit spectral information for both training-time regularization and inference-time noise suppression, with demonstrated efficacy across multiple data modalities and architectures (Harris et al., 2020, Chen et al., 5 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FMix.