Neural Spectral Bias in Networks

Updated 19 May 2026

Neural spectral bias is the tendency for networks to prioritize smooth, low-frequency components, which accelerates learning for simple functions.
The bias arises from rapid eigenvalue decay in the neural tangent kernel and is evident in areas from image processing to reinforcement learning.
Mitigation methods like Fourier feature encodings and alternative activations can help accelerate high-frequency convergence and improve overall performance.

Neural spectral bias denotes the empirical and theoretically grounded tendency of neural networks, especially those trained via gradient-based optimization, to fit low-frequency (smooth or low-complexity) components of a target function far more rapidly than high-frequency (oscillatory or complex) ones. This bias arises universally in supervised regression, reinforcement learning, physics-informed learning, operator learning, discrete-input modeling, and even modern deep networks for image and time-series prediction. While neural architectures are formally universal function approximators, spectral bias induces a highly anisotropic effective prior during optimization—privileging smoothness and making high-frequency components much slower to attain unless explicit architectural or algorithmic interventions are introduced.

1. Theoretical Basis and Formulation of Spectral Bias

Under the neural tangent kernel (NTK) regime, fully connected multilayer perceptrons (MLPs) of infinite width can be linearized around initialization. Training by gradient descent on a sample set $\{x_i, y_i\}$ with squared loss yields an update rule in function space,

$f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$

where $K(x, x') = \mathbb{E}_\theta [\nabla_\theta f(x;\theta) \cdot \nabla_\theta f(x';\theta)]$ is the NTK evaluated at initialization. Diagonalizing $K$ with eigenvalues $\{\lambda_k\}$ and corresponding basis functions $\{\phi_k\}$ , the error in the $k$ -th mode decays as $|1 - \eta\lambda_k|^t$ . For standard activations such as ReLU or Tanh on typical data domains, the NTK’s eigenvalues decay rapidly with the frequency index, often obeying $\lambda_k \lesssim k^{-(d+1)}$ in $d$ -dimensional spaces (Yang et al., 2022, Cao et al., 2019). Consequently, low-frequency components are fit quickly, while high-frequency ones persist for exponentially longer training times.

This phenomenon arises not only in continuous Fourier bases but in discrete domains via the Walsh–Hadamard expansion, where the NTK eigenvalues decay with the degree of Boolean Fourier modes, leading to prioritization of low-degree interactions (Gorji et al., 2023). Notably, the precise rate and shape of spectral bias depend on factors including activation function, data geometry, network width and depth, optimizer choice, and input preprocessing (Hong et al., 2022, Dandi et al., 2021, Lazzari et al., 2023).

2. Mechanisms and Architectural Dependence

Spectral bias ultimately results from the combined structure of the network’s implicit kernel, the frequency localization properties of its parameterization, and the optimization trajectory under gradient-based learning.

Kernel Perspective: For input data on a domain such as $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 0 or the unit sphere, the NTK or other relevant kernels are typically diagonalized by trigonometric, polynomial, or spherical harmonic bases, whose eigenvalues drop off steeply for high frequencies. In ReLU networks, this emerges directly from piecewise linearity: the network acts as a continuous piecewise linear (CPWL) function, whose Fourier coefficients decay as $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 1 in a generic direction $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 2 (Rahaman et al., 2018).
Activation Function: The spectral response of the learned mapping is tightly linked to the activation function. ReLU and similar monotonic activations impose a strong low-pass bias due to their connection with spline smoothing (higher-order derivatives). For activation $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 3, the network’s singular spectrum exhibits a steep initial drop, reflecting aggressive smoothing of higher derivatives (Lucey, 25 Apr 2025). Non-monotonic activations (e.g., sinc, Gaussian) generate near-orthobases enabling targeted, iteration-efficient high-frequency fitting (Lucey, 25 Apr 2025, Hong et al., 2022).
Layerwise Analysis: Spectral power allocation is not uniform across network depth. Shallower (early) layers contribute proportionally more to high-frequency modes, while deeper layers amplify low frequencies (Dandi et al., 2021). In networks with skip-connections, shallow paths preferentially transmit high-frequency components (Dandi et al., 2021).
Manifold Geometry: For data lying on a nontrivial manifold, the local embedding can transfer high manifold-frequencies to low ambient frequencies, partially correcting the bias and making high-frequency labelings easier to learn when the data geometry is highly curved (Rahaman et al., 2018).

3. Empirical Manifestations and Measurement

Spectral bias has been validated broadly across settings:

Synthetic Regression: In fitting sums of sinusoids, neural nets first reduce low- $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 4 coefficients, with high- $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 5 error lagging critically behind (Rahaman et al., 2018, Cao et al., 2019).
Image and Signal Domains: Modern CNNs and ViTs trained on image classification exhibit greater within-class low-frequency smoothness and higher between-class frequency content, especially with increased model size, data augmentation, and self-distillation (Fridovich-Keil et al., 2021).
Physics-Informed and PDE learning: Neural surrogates for PDE solutions, e.g., PINNs and neural operators, systematically underfit high-frequency components (quantified via frequency-resolved error, Barron norm, and higher moments), leading to over-smooth solutions except when second-order/quasi-Newton optimizers or spectral-aware losses are used (Khodakarami et al., 22 Feb 2026, Khodakarami et al., 17 Mar 2025).
Discrete Boolean data: Multi-layer MLPs over binary inputs concentrate energy on low-degree Walsh modes while failing on higher-order interactions unless regularized (Gorji et al., 2023).
Time series and sequential domains: Low-frequency prediction is accurate early in training, while high-frequency modes converge much more slowly, regardless of whether the underlying model is a DNN, Transformer, or frequency-domain model (Sun et al., 29 Oct 2025).

Measurement is performed by monitoring the time evolution of per-frequency error, tracking the decay of residuals projected onto the spectral basis (e.g., via DFT or kernel eigenbasis) (Cao et al., 2019, Kammonen et al., 2024).

4. Approaches to Mitigate or Exploit Spectral Bias

Several architectural and algorithmic strategies have been identified:

Fourier and Positional Feature Encodings: Prepending input coordinates with Fourier features (i.e., $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 6 for $f_{t+1}(\cdot) - f^*(\cdot) = (I - \eta K)^t [f_0(\cdot) - f^*(\cdot)],$ 7 random or learned) converts the NTK into a composite kernel with exponentially slower eigenvalue decay, enabling orders-of-magnitude faster fitting of high-frequency modes (Yang et al., 2022, Xie et al., 9 Sep 2025). Adaptive random Fourier features and learned frequency mappings further flatten the spectral response (Kammonen et al., 2024, Xie et al., 9 Sep 2025).
Spectral-Aware Regularization: Penalties in the Fourier or Walsh–Hadamard domain (e.g., enforcing sparsity or boosting high-degree coefficients) can force the network to dedicate capacity to high-complexity modes, overcoming the canonical low-frequency preference (Gorji et al., 2023).
Activation Function Engineering: Employing Hat activations or higher-order B-splines, as opposed to ReLU, removes separation between frequency response eigenvalues and eliminates spectral bias, equalizing the convergence rates for all modes (Hong et al., 2022).
Optimizer Choice: First-order optimizers (plain GD or Adam) accentuate spectral bias due to ill-conditioning of the NTK. SGDM (momentum), Adam with variance-adaptation, or second-order (quasi-Newton) methods can collapse the spectrum of convergence rates, accelerate high-frequency learning, and, in the idealized limit, erase spectral bias altogether (Khodakarami et al., 22 Feb 2026, Farhani et al., 2022).
Loss Modifications: Frequency-domain or binned-spectral-power losses compensate for the MSE’s domination by low modes, directly addressing bias in neural operators and sequence models (Khodakarami et al., 22 Feb 2026, Khodakarami et al., 17 Mar 2025, Sun et al., 29 Oct 2025).
Initialization Schemes: Layer-aware initialization strategies (e.g., ordered SWIM) that align weight scale with expected frequency response preload early layers to encode low-frequency content and later layers to focus on high-frequency structure, matching the expected gradient-dynamics-induced path of learning (Homma et al., 4 Nov 2025).

5. Implications in Practice and Scientific Learning

Spectral bias has substantial implications for:

Generalization: The initial preference for smooth (low-frequency) fits aligns with canonical connections between function complexity and generalization ability. Spectral bias enhances robustness to small perturbations and prevents overfitting high-frequency noise, but can limit resolution of fine-scale details (Fridovich-Keil et al., 2021, Gorji et al., 2023).
Sample Efficiency and Stability: In reinforcement learning, value approximators with spectral bias converge slowly on high-frequency components essential for tasks with long horizons or complex dynamics, impacting sample efficiency and algorithmic stability. Fourier-feature augmentations or other mitigation methods can yield substantial acceleration and stability gains (Yang et al., 2022).
Representational Efficiency and Compression: In weight-generating implicit networks, breaking spectral bias (via spectral bandlimiting or adaptive encoding) enables high-parameter compression without loss of high-frequency content (Xie et al., 9 Sep 2025).
Interpretability and Data Geometry: Spectral analysis uncovers relationships between learned function complexity, data manifold structure, and inductive bias, informing architecture design in domains from vision to PDE solvers (Rahaman et al., 2018, Dandi et al., 2021).
Optimization and Training Curves: Non-monotonicity of spectral bias under deep double-descent exposes the link between on-manifold memorization, off-manifold flattening, and generalization—a practical tool for early stopping and validation-free monitoring (Zhang et al., 2020).

6. Limitations, Variants, and Open Directions

Spectral bias is robust across architectures and domains but not immutable:

Monotonicity can fail: At late stages of overparametrized network training, high-frequency modes may decrease, driven by implicit off-manifold regularization (Zhang et al., 2020).
Data manifold complexity, depth, and architecture modulate the effective bias (Rahaman et al., 2018, Dandi et al., 2021).
There exist activation/loss/optimization regimes (e.g., Hat activation, second-order optimization) under which spectral bias is erased (Hong et al., 2022, Khodakarami et al., 22 Feb 2026).
In high-dimensional tasks requiring fine-grained discrimination (e.g., scientific operator learning, robust classification against adversaries), careful engineering of encoding, loss, architecture, and optimization is required to systematically mitigate or exploit spectral bias for task-specific objectives (Khodakarami et al., 22 Feb 2026, Kammonen et al., 2024, Gorji et al., 2023).

Ongoing research addresses theoretically characterizing the convergence of regularized or adaptive networks under the NTK, extensions to structured data and graph domains, and principled spectral-norm-aware architectural design.

7. Summary Table: Key Mechanisms and Mitigation Strategies

Mechanism/Component	Spectral Bias Effect	Mitigation/Exploitation Approach
ReLU/monotonic activations	Strong low-frequency preference (algebraic decay)	Hat/spline activations for uniform convergence
Gradient-based (GD, Adam)	Exponentially slower learning for high-freq modes	Momentum, quasi-Newton optimizers flatten decay
Standard kernel/L2 loss	MSE dominated by low modes	Frequency-aware or BSP loss
Input as raw coordinates	Exponential suppression in low-dim MLPs	Fourier features, positional encoding
Shallow (early) network layers	Amplify high-frequency fitting	Selective fine-tuning for high-freq tasks
Walsh-Hadamard spectrum	Low-degree bias (discrete input spaces)	Spectral L1/Hadamard regularization
Data manifold complexity	High curvature transmutes ambient low-k to manifold high-k	Data augmentation/embedding