Fourier Analysis of ReLU Networks
- The paper demonstrates how Fourier methods decompose ReLU activations into preserved baseband, DC components, and higher harmonic terms with diminishing amplitudes.
- It details the distributional Fourier transform of ReLU, revealing rapid high-frequency decay and the dominant influence of low-frequency, DC, and intermodulation effects.
- The study highlights spectral bias, showing that ReLU networks prioritize learning low-frequency components while complex high-frequency approximations require additional network capacity.
Rectified Linear Unit (ReLU) networks, which employ the nonlinearity at each hidden unit, have become the dominant architecture for deep learning. The Fourier analysis of such networks seeks to rigorously characterize their action on signals and functions in the frequency domain, illuminating how spectral components are transformed, generated, or suppressed by both individual activations and deep network compositions. This perspective reveals distinct mechanisms: the propagation of both original and newly created frequency components, the induction of DC (zero-frequency) features, the pattern of spectral decay, and the complexity of approximating functions with given spectral properties.
1. Spectral Decomposition of the ReLU Nonlinearity
The ReLU activation is a piecewise-linear, non-invertible function that introduces significant nonlinearity in neural networks. Its effect on input signals can be analyzed by expressing the input as a sum of cosines:
By rewriting using
and expanding the square-root via a Taylor series, one obtains
where and is a sum of cosines at frequencies , , and . The term is a DC offset, yields first harmonic and intermodulation terms, and higher generate higher-order frequency combinations, but with exponentially decaying coefficients . In practice, the output remains nearly band-limited, as very little energy appears at high (Kechris et al., 2024).
2. Fourier Transform and Distributional Characterization
The classical Fourier transform of the ReLU function, viewed as (with the Heaviside function), can be formally derived in the distributional sense:
where denotes principal value, and is the derivative of the Dirac delta. The roll-off expresses the rapid decay of high-frequency energy, while the singular DC derivative term corresponds to the dominant influence of low frequencies and the static offset produced by ReLU (Kechris et al., 2024).
3. Spectral Terms: DC and Oscillatory Output Structure
The spectrum of consists of three principal components:
- Preservation at Baseband: Each original frequency appears at amplitude due to the linear term.
- DC Term: Direct calculation yields a closed-form DC amplitude of , which generalizes to convolutional pre-filtered signals via , where are filter gains.
- Higher Harmonics: First-order terms at have amplitude , and at the amplitude is . Higher-order harmonic amplitudes decrease rapidly, rendering the ReLU output spectrally concentrated at the original and first few harmonics. The DC and low-order frequencies dominate the energy (Kechris et al., 2024).
4. Spectral Complexity, Approximation, and Function Spaces
The expressivity of ReLU networks with respect to function approximation is intimately tied to the spectral content of the target function. Classical results quantify how the so-called Fourier–Barron norm
controls the sample complexity: functions with rapidly decaying high-frequency are more efficiently approximated. The transition from infinite-width to sparse finite-width networks is precisely characterized by Radon-based norms and provides sup-norm bounds of for compact domains (Domingo-Enrich et al., 2021).
Recent constructive complexity bounds demonstrate that, for any with absolutely integrable Fourier transform on , there exists a ReLU network of width and depth with
to achieve uniform error , and with only weak logarithmic dependence on the high-frequency “tail” of (Davis et al., 2024). This aligns the classical Fourier perspective with practical, finite-depth ReLU neural networks.
5. Spectral Bias and Frequency Learning Dynamics
A piecewise linear (CPWL) ReLU network expresses functions as a sum of polytopic regions, each contributing specific affine pieces. The exact Fourier transform of such networks decays polynomially as in most directions, with exceptional slower decay along certain face normals. This results in a pronounced bias toward functions with energy at low frequencies—a phenomenon termed spectral bias. Empirical results confirm that, during gradient-based optimization, low-frequency components are learned first, and only as training progresses do higher-frequency components emerge in the network output, regardless of their relative amplitude in the target function (Rahaman et al., 2018).
Manifold complexity further modulates spectral bias: data concentrated on oscillatory or curved manifolds can enable faster acquisition of high frequencies in latent coordinates. However, in high ambient dimension, parameter perturbations robustly preserve low-frequency modes, while high frequencies—occupying a vanishing fraction of parameter space—are fragile to changes (Rahaman et al., 2018).
6. Empirical Validation and Role of DC Component in CNNs
Numerical experiments validate the frequency-domain expansion of ReLU outputs: pointwise relative RMSE between analytic and actual ReLU is using as few as five Taylor terms, and frequency spectra match theoretical predictions for both synthetic and learned filters. In convolutional neural networks (CNNs), the initial ReLU layer consistently injects a DC feature that is linearly separable by global pooling or fully connected mappings.
Comparative training experiments show that adding the DC feature to a linear network achieves the same rapid convergence and low training loss as networks incorporating ReLU. The Euclidean distance between initial and trained weights remains small in architectures where the DC is available, revealing that this feature allows models to solve frequency-discrimination tasks using inputs proximal to the initial random state, without requiring substantial filter modification. A single random convolution + ReLU can perform perfect class separation of sinusoids solely via the DC channel (Kechris et al., 2024).
7. Impact on Network Design and Downstream Applications
The DC term dominates the spectral summary provided by ReLU transformations—acting effectively as an automatic “feature extractor” that facilitates global-average pooling and efficient downstream linear discrimination. This mechanism persists in real-world models and tasks, such as cardiac PPG signal frequency estimation. The higher-order harmonics created by ReLU are rapidly attenuated by subsequent low-pass filtering or pooling, ensuring representational stability. Because the DC component depends smoothly on the input spectrum, ReLU-equipped networks can capitalize on perturbative regimes around random initializations, explaining robust and efficient learning in practice (Kechris et al., 2024).
The global, low-frequency dominance encoded in the CPWL structure of ReLU networks also clarifies the observed spectral bias in practical learning scenarios: while universal in principle, ReLU networks require either large Lipschitz constants or many activation regions to model strong high-frequency features (Rahaman et al., 2018, Domingo-Enrich et al., 2021, Davis et al., 2024). As a result, spectral properties concretely limit the efficiency with which rapidly oscillating or discontinuous targets can be learned or approximated.
References:
- (Kechris et al., 2024) "DC is all you need: describing ReLU from a signal processing standpoint"
- (Rahaman et al., 2018) "On the Spectral Bias of Neural Networks"
- (Domingo-Enrich et al., 2021) "Tighter Sparse Approximation Bounds for ReLU Neural Networks"
- (Davis et al., 2024) "Approximation Error and Complexity Bounds for ReLU Networks on Low-Regular Function Spaces"