Spectral Bias in Neural Networks

Updated 15 March 2026

Spectral bias in neural networks is the phenomenon where low-frequency (smooth) components are learned significantly faster than high-frequency (detailed) aspects.
Theoretical analyses using NTK spectral decomposition and eigenvalue decay explain the delayed learning of high-frequency modes and guide network architecture choices.
Empirical methods such as discrete Fourier transforms and Walsh-Hadamard analysis validate this bias, shaping regularization techniques and optimization strategies.

Spectral bias in neural networks refers to the empirically and theoretically observed phenomenon that, under gradient-based optimization, neural networks fit low-frequency (smooth) components of a target function significantly faster than high-frequency (oscillatory or detailed) components. This ordering of function-space fitting, widely called the "frequency principle," has profound implications for generalization, robustness, and the design of neural architectures and training procedures. The manifestation, theoretical underpinnings, measurement, and mitigation of spectral bias constitute an active and multi-dimensional research field.

1. Formal Definition and Origin

Spectral bias is operationalized by decomposing a learned function $f(x; \theta)$ in a basis indexed by frequency (e.g., Fourier or spherical harmonics), so that

$f(x) = \int \hat{f}(\omega) e^{i\omega \cdot x} d\omega$

or, in the case of images, discrete variants or functionals over eigenfunctions of relevant kernels. During training, $|\hat f(\omega)|$ at small $|\omega|$ increases rapidly, while high- $|\omega|$ modes grow slowly or remain suppressed for long periods (Fridovich-Keil et al., 2021, Cao et al., 2019, Rahaman et al., 2018).

In overparameterized networks trained with gradient descent or its variants, this behavior is closely linked to the Neural Tangent Kernel (NTK) regime. The training dynamics linearized around initialization reduce to

$f_t(x) - f^*(x) = \exp(-\eta K t)[f_0(x) - f^*(x)]$

where $K$ is the NTK and $\eta$ is the learning rate. The spectrum of $K$ determines convergence rates: modes with larger NTK eigenvalues (usually corresponding to lower frequencies) decay much faster than those with small eigenvalues (higher frequencies), resulting in the characteristic spectral bias (Cao et al., 2019, Dandi et al., 2021, Fang et al., 2024).

2. Theoretical Analysis

Spectral bias has been rigorously characterized in several frameworks:

NTK Spectral Decomposition: Gradient dynamics decompose along the NTK eigenfunctions. For input distributions such as the unit sphere, the eigenfunctions are spherical harmonics of degree $\ell$ , and NTK eigenvalues $\mu_\ell$ decay quickly with $\ell$ as $\mu_\ell \sim \ell^{-(d+1)}$ (Cao et al., 2019, Choraria et al., 2022). Modes of low degree—corresponding to smooth, low-frequency variation—are learned exponentially faster.
Finite-Element and Activation-Dependent Perspective: For shallow ReLU networks, the spectral bias can be understood via the spectrum of the finite element mass matrix induced by the activation function. ReLU induces a strong low-frequency bias, assessed via eigenvalue ratios that scale as $(n/j)^4$ for modes of index $j$ in a network of width $n$ , whereas piecewise linear B-spline ("Hat") activation removes this separation, making convergence rates uniform across frequencies (Hong et al., 2022, Sahs et al., 13 Mar 2025).
Training Dynamics and Geometry: In coordinate-based MLPs, the geometry of activation regions mediates the rate at which gradient descent fits different frequencies. Low-dimensional, dense inputs (e.g., $[0,1]^d$ grids with $d\leq3$ ) lead to severe under-utilization of activation regions for high-frequency targets, almost completely prohibiting convergence for these components unless sinusoidal positional encodings or similar techniques are used to "densify" the region space (Lazzari et al., 2023).
Layer-wise and Depth Effects: Spectral bias also has a layer-wise character. Initial network layers contribute disproportionately to high-frequency component fitting (with layer-wise NTK eigenvalue ratios scaling as $\ell^2$ between layers for degree- $\ell$ harmonics), though deeper layers ultimately dominate smooth, low-frequency features (Dandi et al., 2021, Yang et al., 2019).

3. Empirical Manifestations and Measurement

Direct measurement of spectral bias is increasingly sophisticated:

Label-smoothing Noise Fitting: For image classifiers, label-smoothing with spatial sinusoidal "noise" is used to probe the frequency-dependent learnability. The network’s ability to fit modified labels traces the capacity for different frequencies (Fridovich-Keil et al., 2021).
Pathwise DFT: The DFT of network outputs along linear paths in input space (either within-class or between-class) reveals high-frequency transitions are concentrated between classes, reinforcing the view of neural networks as learning smooth class-conditional functions sharply partitioned at boundaries (Fridovich-Keil et al., 2021).
Walsh-Hadamard and Boolean Fourier Decomposition: Discrete-input domains utilize the Walsh-Hadamard transform to reveal a low-degree bias: lower-degree (simpler) Fourier coefficients are acquired before higher-degree ones, limiting sensitivity to complex multi-bit interactions (Gorji et al., 2023).
Frequency-resolved Error Metrics: In physics-informed learning, error can be decomposed as $E(\omega) = \| P_{\omega}(u_{\theta} - u^*) \|_{L^2}$ , demonstrating that low-frequency errors decay rapidly, with high-frequency modes poorly fit unless model and optimizer modifications are introduced (Khodakarami et al., 22 Feb 2026).

4. Effects of Architecture, Training, and Interventions

a. Architecture and Activation Function

Activation Shape: ReLU and similar activations produce mass-matrix spectra favoring low frequencies; polynomials and B-splines can flatten or modify this spectrum (Hong et al., 2022, Sahs et al., 13 Mar 2025, Choraria et al., 2022).
Depth and Width: Shallow networks exhibit more severe low-frequency bias; depth can both alleviate and reverse this trend depending on frequency and target complexity (Yang et al., 2019).
Multiplicative Structures: Polynomial networks (e.g., $\Pi$ -Nets) possess a slower NTK eigendecay ( $k^{-d/2-2}$ ) relative to standard MLPs ( $k^{-d-1}$ ), enabling faster fitting of high-frequency content (Choraria et al., 2022).

b. Training Protocols

SGD Variants: Standard SGD maintains a strong low-frequency bias, while adaptive and random Fourier features or momentum-based methods (e.g., Adam, SGDM) can substantially diminish this bias—momentum recasts training dynamics as damped oscillators, accelerating convergence of high- $k$ modes (Kammonen et al., 2024, Farhani et al., 2022).
Initialization: Layer-wise scaling of hidden activation breadth (e.g., via SWIM) can be used to "pre-load" different frequency bands into different layers, improving final multi-scale accuracy and accelerating training (Homma et al., 4 Nov 2025).

c. Losses and Regularization

Spectral-aware Losses: Explicit inclusion of frequency-domain losses or regularization, such as L1 penalties in Walsh-Hadamard space (Gorji et al., 2023), spectral loss for PDEs (Khodakarami et al., 22 Feb 2026), or the FreLE loss in time series (Sun et al., 29 Oct 2025), enables or enforces the learning of underrepresented frequencies.
Data Augmentation and Distillation: Mixup, AutoAugment, and self-distillation further amplify the within-class vs. between-class frequency separation—suppressing intra-class high frequencies and sharpening class boundaries (Fridovich-Keil et al., 2021).

5. Implications for Generalization, Robustness, and Overfitting

Spectral bias acts as a form of implicit regularization, favoring solutions with predominantly low-frequency structure that tend to generalize well. Early-stage training is described by kernel regression under the NTK, focusing on smoother functions; memorization of high-frequency noise occurs much later, often associated with overfitting or the "double descent" phenomenon in test error curves (Zhang et al., 2020).

Late-stage training may even exhibit non-monotonic evolution in the spectrum: after fitting noise, the network's off-manifold high-frequency energy diminishes, yielding a smoother extrapolation away from the data manifold and a second descent in test error (Zhang et al., 2020).

Robustness to adversarial or stochastic noise can trade off with spectral bias: networks fit with adaptive Fourier features capture high-frequencies and resist sparse perturbations, but may overfit to global noise unless regularized or stopped early (Kammonen et al., 2024).

6. Methods to Mitigate or Control Spectral Bias

A variety of strategies have been developed to mitigate, exploit, or control spectral bias:

Input Encoding: Sinusoidal or Fourier encodings (as in NeRFs or Fourier Feature Networks) expand the network's capacity for high-frequency content by enriching the input basis (Lazzari et al., 2023, Yang et al., 2022).
Functional Regularization: L1 penalties in spectral space (Walsh-Hadamard, Fourier, etc.) encourage spectral sparsity and facilitate the fitting of complex targets (Gorji et al., 2023).
Loss Engineering: Spectral-aware losses, such as the binned spectral power (BSP) loss, directly penalize underfitting of high-frequency modes, improving the recovery of fine detail in operator learning, physics-informed modeling, and time-series prediction (Khodakarami et al., 22 Feb 2026, Sun et al., 29 Oct 2025).
Architectural Modifications: Polynomial, multi-grade, or composition-based architectures (e.g., composing several SNNs in MGDL) can be used to build high-frequency structure from combinations of low-frequency components (Fang et al., 2024, Choraria et al., 2022).
Second-order and Curvature-aware Optimization: Quasi-Newton and curvature-preconditioned optimizers flatten per-mode convergence rates, effectively nullifying the NTK-derived frequency hierarchy (Khodakarami et al., 22 Feb 2026, Farhani et al., 2022).
Activation Redesign: Employing activations with flat Fourier spectra (e.g., Hat functions) abolishes differential convergence rates and yields uniform frequency fitting (Hong et al., 2022).

7. Contexts, Limitations, and Open Problems

Spectral bias is universal across standard MLPs, CNNs, transformers, and operator-learning architectures, but its severity varies with input manifold geometry, task complexity, and data distribution. Highly structured or curved data manifolds can amplify high-frequency expressivity of networks, mitigating bias naturally (Rahaman et al., 2018).

Despite advances in mitigation, balancing spectral bias remains a task- and data-dependent problem. Excessive high-frequency fitting can harm robustness (especially out-of-distribution or under sparse sampling), while excessive bias leads to underfitting of complex targets. Automated, frequency-adaptive regularization and architectural tuning—potentially guided by empirical spectrum monitoring—represent open and promising directions (Fridovich-Keil et al., 2021, Homma et al., 4 Nov 2025, Fang et al., 2024).

Key References:

(Fridovich-Keil et al., 2021) Fridovich-Keil et al., 2021 (Cao et al., 2019) Cao et al., 2019 (Rahaman et al., 2018) Rahaman et al., 2018 (Fang et al., 2024) Xu, 2024 (Khodakarami et al., 22 Feb 2026) Wan and Kovachki, 2026 (Lazzari et al., 2023) Antun et al., 2023 (Choraria et al., 2022) Lee et al., 2022 (Hong et al., 2022) Lou et al., 2022 (Gorji et al., 2023) Gorji et al., 2023 (Zhang et al., 2020) Zhang et al., 2020 (Sun et al., 29 Oct 2025) Xuan et al., 2025 (Homma et al., 4 Nov 2025) Watanabe & Honda, 2025 (Xie et al., 9 Sep 2025) Lee et al., 2025 (Sahs et al., 13 Mar 2025) Keller-Reshef et al., 2025 (Kammonen et al., 2024) Gardner & Schaeffer, 2024 (Dandi et al., 2021) Cornish et al., 2021 (Yang et al., 2019) Yang et al., 2019