A Fourier Space Perspective on Diffusion Models (2505.11278v1)

Published 16 May 2025 in stat.ML, cs.CV, cs.LG, and stat.ME

Abstract: Diffusion models are state-of-the-art generative models on data modalities such as images, audio, proteins and materials. These modalities share the property of exponentially decaying variance and magnitude in the Fourier domain. Under the standard Denoising Diffusion Probabilistic Models (DDPM) forward process of additive white noise, this property results in high-frequency components being corrupted faster and earlier in terms of their Signal-to-Noise Ratio (SNR) than low-frequency ones. The reverse process then generates low-frequency information before high-frequency details. In this work, we study the inductive bias of the forward process of diffusion models in Fourier space. We theoretically analyse and empirically demonstrate that the faster noising of high-frequency components in DDPM results in violations of the normality assumption in the reverse process. Our experiments show that this leads to degraded generation quality of high-frequency components. We then study an alternate forward process in Fourier space which corrupts all frequencies at the same rate, removing the typical frequency hierarchy during generation, and demonstrate marked performance improvements on datasets where high frequencies are primary, while performing on par with DDPM on standard imaging benchmarks.

Summary

The paper reveals that the standard DDPM forward process corrupts high-frequency components faster due to its uniform noise addition, establishing a reverse generation hierarchy.
It introduces the EqualSNR framework that modifies noise covariance in Fourier space to maintain a uniform signal-to-noise ratio across all frequency components.
Experimental results demonstrate that EqualSNR significantly improves high-frequency fidelity while maintaining competitive performance on standard imaging benchmarks.

This paper, "A Fourier Space Perspective on Diffusion Models" (2505.11278), investigates the inductive bias of standard Denoising Diffusion Probabilistic Models (DDPM) forward processes from the perspective of Fourier space. It highlights how the spectral properties of common data modalities like images, audio, and proteins influence the diffusion process and its impact on generation quality, particularly for high-frequency components.

The core observation is that data modalities where diffusion models excel typically exhibit a Fourier power law, meaning the variance and magnitude of frequency components decay exponentially with increasing frequency (see Figure 1 [left]). The standard DDPM forward process adds white Gaussian noise, which has equal variance across all frequency components. When applied to data with the Fourier power law, this uniform noise addition results in high-frequency components being corrupted much faster and earlier (in terms of Signal-to-Noise Ratio, SNR) than low-frequency components (\Cref{sec:ddpm}, \Cref{fig:fig1} [center], \Cref{fig:snr_heatmaps} forward). The SNR of the $i$ -th frequency component at time $t$ in DDPM is given by:

$s_t^{\text{DDPM}(i) = \frac{\overline{\alpha}_t \mathbf{C}_{i}}{1 - \overline{\alpha}_t}$

where $\mathbf{C}_i$ is the signal variance of frequency $i$ in Fourier space, and $\overline{\alpha}_t$ is the cumulative signal scaling at time $t$ . Since $\mathbf{C}_i$ is much smaller for high frequencies, their SNR decreases much faster.

This faster noising of high frequencies in the forward process imposes a generation hierarchy in the reverse process: the model learns to generate low-frequency information first, followed by high-frequency details conditionally (\Cref{fig:fwd-bwd-high-low}).

The paper argues that this aggressive noising of high frequencies leads to practical limitations. Specifically, under finite discretization steps (as is the case in practice), the Gaussian assumption for the reverse process distribution $q(x_{t-1} | x_t)$ is violated for high-frequency components. This occurs because the noise variance added relative to the signal variance of high frequencies becomes too large, causing the posterior $q(x_{t-1} | x_t)$ to deviate significantly from a Gaussian distribution (\Cref{sec:technical_explanation}, \Cref{fig:gauss_assumption}). The authors provide a theoretical proposition (\Cref{thm:backward_step}) illustrating this with a simple counterexample where adding noise to a mixture of Gaussians results in a non-Gaussian posterior.

To address this, the paper proposes a framework for alternate forward processes based on frequency-specific SNR (\Cref{prop:noise_variance}). The key idea is to modify the covariance structure of the noise added in Fourier space. They introduce the EqualSNR process, which adds noise whose variance for each frequency component is proportional to the signal variance of that component ( $\Sigma_{ii} = c C_i$ ). This ensures that the SNR is equal across all frequency components at every timestep:

$s_t^{\text{EqualSNR}(i) = \frac{\overline \alpha_t C_i}{(1-\overline \alpha_t) c C_i} = \frac{1}{c} \frac{\overline \alpha_t}{(1-\overline \alpha_t)}$

This property implies that information is corrupted at the same rate for all frequencies, removing the inherent generation hierarchy of DDPM (\Cref{fig:fig1} [right], \Cref{fig:snr_heatmaps} forward). The authors also mention FlippedSNR, which attempts to invert the hierarchy, noising low frequencies faster, though this proved difficult to train successfully in experiments (\Cref{app:Further experimental illustrations and results}).

Implementing these alternate forward processes requires modifying the standard DDPM training and sampling procedures:

Training Algorithm (\Cref{algo:training}):
- The forward process in Fourier space uses noise scaled by the square root of the frequency-wise signal variance, $C^{1/2}$ , as per the desired SNR property.
- The loss function is a mean squared error between the predicted clean sample ( $\hat{y}_0$ ) and the true clean sample ( $y_0$ ) in Fourier space, weighted by the inverse square root of the frequency-wise signal variance ( $C^{-1/2}$ ): $\mathcal{L}_t = \|{C}^{-1/2} ({y}_0 - \hat{y}_0)\|^2$ . This loss is shown to be an Evidence Lower Bound (ELBO) for the modified process (\Cref{thm:loss_is_ELBO_complex}).
- Crucially, the neural network (typically a U-Net) still operates on the pixel-space representation of the noisy sample: $f_\theta({F}^{-1}({y}_t), t)$ , where $F$ is the Fourier Transform operator. The output of the U-Net in pixel space is then transformed back to Fourier space for the loss calculation $\hat{y}_0 = F \circ f_\theta(F^{-1}(y_t), t)$ . This maintains the strong inductive bias of the U-Net architecture, which is well-suited for pixel-space image data.
- The frequency-wise signal variance $C_i$ is computed empirically from the training data.
Sampling Algorithm (\Cref{algo:sampling}):
- Adapts the DDIM sampling steps to operate in Fourier space.
- Starts with noise sampled based on the frequency-wise signal variance ( $\boldsymbol{\epsilon}_C$ ).
- In each step, the model predicts the clean sample $\hat{y}_0^{(t)}$ from the noisy Fourier-transformed input $y_t$ using the pixel-space U-Net ( $f_\theta({F}^{-1}({y}_t), t)$ ).
- The next noisy sample $y_{t-1}$ is computed based on $y_t$ , $\hat{y}_0^{(t)}$ , and the schedule coefficients $\overline{\alpha}_t$ , ensuring the correct SNR trajectory is followed.
Calibration (\Cref{subsec:calibration}): To ensure fair comparison, different forward processes are calibrated to have the same average SNR across frequencies at each timestep by adjusting their respective mixing coefficients $\overline{\alpha}_t$ .

The paper experimentally evaluates EqualSNR against standard DDPM on imaging benchmarks (CIFAR10, CelebA, LSUN Church) and a synthetic high-frequency dataset (Dots).

The experiments show:

The learned reverse process indeed mirrors the forward process's SNR trajectory (\Cref{fig:snr_heatmaps} reverse, \Cref{fig:fwd-bwd-high-low}, \Cref{fig:freq_var_app}).
Analyzing spectral magnitudes (\Cref{fig:magnitutes-real-vs-generated}) and training a simple classifier on high-frequency features (\Cref{tab:classifier-cifar10-high}) reveals that DDPM generated samples have noticeable artifacts in high frequencies, making them distinguishable from real data. EqualSNR significantly improves high-frequency generation quality, making its samples much harder to classify as fake based on these features. This has direct practical implications for applications requiring high-fidelity details, such as medical imaging or realistic DeepFake generation.
On the synthetic Dots dataset (\Cref{fig:dots-intensity}), where high-frequency details are dominant, EqualSNR clearly outperforms DDPM in capturing the spatial distribution and intensity of the sparse white pixels, demonstrating its practical advantage for specific high-frequency-centric tasks.
Despite improved high-frequency quality, EqualSNR performs on par with DDPM on standard FID benchmarks for imaging datasets (\Cref{tab:results_all-schedules}). The authors suggest FID might not fully capture the high-frequency inaccuracies observed with DDPM. EqualSNR also appears to saturate performance with fewer sampling steps compared to DDPM.

Implementation considerations include empirically computing the frequency-wise variances $C_i$ from the training data. The use of a standard pixel-space U-Net is a practical choice, leveraging existing highly optimized architectures. The complexity of the Fourier Transform adds a computational overhead, but for image sizes where FFT is efficient, this may be acceptable, especially given the potential improvements in generation quality for certain applications.

The paper discusses related work on diffusion in spectral domains, alternate noise schedules focusing on frequency weighting, and the existing observations about the low-to-high frequency generation hierarchy in diffusion models. This work is distinguished by its detailed analysis of how faster noising affects the Gaussian assumption and its proposal and evaluation of a practical, hierarchy-free forward process.

In conclusion, the paper provides a practical Fourier-space framework for understanding and improving diffusion models. It demonstrates that explicitly accounting for data's spectral properties during the forward process, particularly by ensuring a uniform noising rate across frequencies (EqualSNR), can lead to substantially better high-frequency generation quality while maintaining performance on standard benchmarks. This has significant practical implications for domains where accurate high-frequency details are crucial, including potential safety concerns related to generating highly realistic synthetic data.

PDF Markdown

Tweets

https://twitter.com/fabianfalck/status/1926303973327155346

https://twitter.com/anirbanray_/status/1924356537893437572

https://twitter.com/fly51fly/status/1924585214485143693

https://twitter.com/bohannon_bot/status/1924900667359129673

A Fourier Space Perspective on Diffusion Models (2505.11278v1)

Summary

Related Papers

Tweets