Learnable Fourier Features Overview

Updated 8 December 2025

Learnable Fourier features are adaptive basis functions that extend classical random Fourier features by optimizing frequency parameters via gradient descent.
They enable scalable kernel approximations and efficient high-dimensional feature construction, reducing computational complexity using tensorized and Bayesian methods.
Applications span scientific machine learning, vision transformers, reinforcement learning, and robust continual learning, yielding strong empirical performance and noise robustness.

Learnable Fourier features are parameterized families of basis functions inspired by the Fourier domain, with frequencies and/or composition parameters that are adaptively optimized alongside (or within) models. This approach generalizes classical random Fourier features—which select fixed random frequencies—by making frequency components, weightings, or the entire Fourier basis learnable through gradient-based or closed-form optimization. Learnable Fourier features are central in high-dimensional kernel approximations, scalable Gaussian process surrogates, scientific machine learning, deep reinforcement learning, spatial inductive biases in vision transformers, robust continual learning, and robustification against noise or spectral bias. Key algorithms address the curse of dimensionality, adapt to data-dependent spectra, and provide strong empirical and theoretical performance gains over fixed-basis methods.

1. Mathematical Foundations and Spectrum Parameterization

Learnable Fourier features are motivated by Bochner's theorem, guaranteeing that any continuous shift-invariant positive-semidefinite kernel $k(x, x') = k(x - x')$ can be expressed as an expectation over a Fourier basis: $k(x,x') = \int p(\omega) e^{i \omega^\top (x-x')} d\omega \approx \langle z(x), z(x')\rangle.$ Traditional random Fourier features (RFF) (Băzăvan et al., 2012, Letarte et al., 2018) use $z(x) = \sqrt{2/D} [\cos(w_1^\top x + b_1), ..., \cos(w_D^\top x + b_D)]$ with frequencies $w_i$ drawn from $p(\omega)$ and $b_i$ uniformly over $[0,2\pi]$ .

Learnable Fourier feature frameworks make aspects of this basis adaptive. Example parameterizations include:

Learnable frequency matrices: Introducing a trainable $W \in \mathbb{R}^{d \times k}$ , so that $z(x) = [\sin(2\pi x^\top W); \cos(2\pi x^\top W)]$ with $W$ initialized randomly and updated by backpropagation (Li et al., 2021, Li et al., 2021, Lewandowski et al., 27 Oct 2024).
Sinusoidal layers with learned amplitudes and phases: For instance, in Fourier Learning Machines (FLMs), $f(x)=\sum_j A_j \cos(\omega_j x-\phi_j)$ with $(A_j, \omega_j, \phi_j)$ learned by gradient descent (Rubel et al., 10 Sep 2025).
PAC-Bayesian posterior weighting: Frequencies are assigned posterior weights based on empirical alignment with data, leading to a data-dependent spectrum $q^*(\omega) \propto p(\omega) \exp(-t \widehat{R}_S(h_\omega))$ (Letarte et al., 2018).
Low-rank tensorized multi-dimensional expansions: For high-dimensional problems, deterministic quadrature grids are used per axis, and the resulting weight tensor is stored in a rank- $R$ CP decomposition to control parameter growth (Wesel et al., 2021).

2. Feature Construction and Optimization

Learnable Fourier features can be constructed and trained via several algorithmic workflows:

2.1 Direct Feature Learning by Gradient Descent

Most deep learning instantiations place a trainable Fourier feature or sinusoidal layer at network input or at every hidden layer—e.g., deep Fourier features (Lewandowski et al., 27 Oct 2024). These layers perform

$z = W h + b;\quad \mathrm{feat} = [\sin(2\pi z); \cos(2\pi z)],$

with $W$ , $b$ optimized end-to-end. In scientific ML, FLMs learn frequencies, amplitudes, and phases across a bank of cosine-activated sub-units (Rubel et al., 10 Sep 2025).

2.2 Low-Rank Tensorized Feature Learning

In kernel settings, deterministic quadrature (e.g., eigenfunction expansions) gives features $Z(x) = z^{(1)}(x_1) \otimes ... \otimes z^{(D)}(x_D)$ in $\mathbb{R}^{\hat M^D}$ ; model weights are parameterized as a rank- $R$ CP tensor, vastly reducing storage and computation from $\mathcal{O}(\hat M^D)$ to $\mathcal{O}(R D \hat M)$ . Fitting is performed by block-coordinate descent (ALS), cycling through $D$ factor matrices, each solved as a regularized least-squares problem (Wesel et al., 2021).

2.3 Bayesian/PAC-Bayesian Frequency Weighting

By interpreting the Fourier spectrum $p(\omega)$ as a prior, a (pseudo-)posterior $q(\omega)$ is optimized under a PAC-Bayesian generalization bound: $q^*(\omega) \propto p(\omega) \exp\left( - t \widehat R_S(h_\omega) \right).$ Features are then sampled or re-weighted from $q^*(\omega)$ , yielding kernels that are maximally aligned with the data (Letarte et al., 2018).

2.4 Multimodal or High-Dimensional Adaptations

For spatially structured or multi-modal data, learnable Fourier features are combined with MLPs (to generalize nonlinearly) (Li et al., 2021), or are fused with spatial prompts and frequency-domain prompts through FFT for robust multi-modal tracking (Yang et al., 24 Sep 2025).

3. Theoretical Properties, Regularization, and Robustness

3.1 Approximation Error Rates and Curse of Dimensionality

Random Fourier features achieve Monte Carlo convergence $O(1/\sqrt{M})$ , whereas deterministic (quadrature/eigenfunction) constructions can achieve exponential $O(\exp(-\alpha \hat M))$ error in the number of per-axis basis functions (for smooth kernels or functions). However, a tensor product over $D$ axes makes naive parameterizations intractable ( $\mathcal{O}(\hat M^D)$ ). The CP representation resolves this, preserving accuracy with $\mathcal{O}(R D \hat M)$ parameters and linear complexity in dimension $D$ (Wesel et al., 2021).

3.2 Spectrum Control and Functional Regularization

Learnable Fourier layers enable explicit frequency selectivity: $F_B(x) = [\sin(2\pi x^\top B); \cos(2\pi x^\top B)].$ The initialization variance $\sigma^2$ of $B$ controls the induced kernel smoothness; smaller $\sigma^2$ emphasizes low frequencies (regularization against overfitting), larger $\sigma^2$ enables higher frequency fitting (potential for overfitting noisy components). The induced NTK can be computed exactly in simple settings, and tuning $\sigma^2$ provides direct spectral regularization (Li et al., 2021).

3.3 Sparsity and Robustness via Diagonal Linear Layers

Adding a diagonal trainable scaling after the Fourier embedding yields networks that learn sparse and robust representations of the relevant (nonlinear) Fourier modes—even in the presence of noise (Jeong et al., 3 Sep 2024). The feature-selection behavior arises due to the gradient-flow properties under $\ell_2$ regularization, effectively leading to $\ell_1$ -like implicit regularization on the Fourier coefficients.

3.4 Plasticity and Trainability

Deep Fourier features (sin/cos concatenation in each layer) ensure that at every layer at least one branch operates in an almost linear regime locally, preserving gradient flow across deep compositions and providing robust plasticity in continual learning scenarios. Theoretical results show such architectures preserve trainability and avoid catastrophic forgetting across tasks (Lewandowski et al., 27 Oct 2024).

4. Applications Across Learning Paradigms

4.1 Large-Scale Kernel Learning

Tensorized deterministic Fourier features with CP-decomposed weights permit large-scale kernel ridge regression or GP approximation at linear complexity in sample size and dimension (Wesel et al., 2021).

4.2 Scientific Machine Learning and Signal Representation

FLMs construct fully learnable multidimensional Fourier series, superior for scientific computation (PDEs, OCPs) relative to RFF and deep MLPs: they achieve 1–2 orders of magnitude lower error in regression and control tasks (Rubel et al., 10 Sep 2025).

4.3 Reinforcement Learning and Value Approximation

Learned Fourier features as input layers in deep RL (e.g., SAC) yield controlled spectral bias and improved sample efficiency, faster convergence, and greater stability under noisy bootstrapping and nonstationarity (Li et al., 2021).

4.4 Positional Encoding and Vision Transformers

In spatial transformers, learnable Fourier features serve as positional encodings, endowing models with parameter-efficient, shift-invariant, and continuous representations that generalize beyond fixed-size tables. Learnable Fourier+MLP encodings outperform fixed sinusoids and lookup tables on image generation, object detection, and classification (Li et al., 2021). Visual Fourier prompts, computed via FFT of learnable tokens, inject frequency and spatial priors into frozen ViT backbones for robust RGB-T tracking (Yang et al., 24 Sep 2025).

4.5 Continual and Nonstationary Learning

Replacing standard activations with deep Fourier features in all layers ensures high plasticity and adaptability, preventing trainability loss in nonstationary continual learning on standard benchmarks (e.g., CIFAR-100, Tiny-ImageNet) (Lewandowski et al., 27 Oct 2024).

4.6 Bayesian and Kernel Alignment Learning

PAC-Bayesian learning of RFF frequencies supports kernel selection and approximation with data-adaptive spectrum, providing generalization guarantees and compact feature sets (Letarte et al., 2018).

5. Empirical Performance and Benchmarking

Learnable Fourier feature models display strong empirical advantages:

Faster convergence and improved accuracy: In Reformer and ViT, learnable Fourier+MLP encodings yield better bits-per-dimension and top-1 accuracy than all fixed or table-based baselines, with statistical significance across ablations (Li et al., 2021).
Scalability and efficiency: Tensorized feature methods perform large-scale KRR with millions of samples and high input dimension, outperforming both RFF and variational GP surrogates in regression and classification (Wesel et al., 2021).
Superior noise-robustness: Diagonal Fourier-feature networks automatically discover and suppress spurious non-informative frequencies in noisy regression, reducing error relative to fixed RFF by 2–5 $\times$ (Jeong et al., 3 Sep 2024).
Continual learning generalization: Deep Fourier architectures drastically improve final task accuracy under class-incremental and label-noise protocols relative to ReLU nets (Lewandowski et al., 27 Oct 2024).
Improved tracking robustness: Visual Fourier prompt fusion in VFPTrack improves success rates under occlusion and low-resolution conditions on multiple RGB-T tracking benchmarks, with strongest gains in cross-modal fusion settings (Yang et al., 24 Sep 2025).

6. Limitations, Open Directions, and Comparative Analysis

Limitations include:

Dimensionality and Memory Tradeoffs: While low-rank tensorization enables scalable deterministic Fourier feature models, selection of CP rank $R$ is critical and may indirectly impact expressivity (Wesel et al., 2021).
Spectral Overfitting: Large bandwidth or unconstrained frequency adaptation may permit overfitting to high-frequency noise or spurious signals, highlighting the need for spectral or norm regularization (Li et al., 2021, Jeong et al., 3 Sep 2024).
Parameter Doubling in Deep Fourier Layers: Concatenation of sin/cos features doubles the width per layer, so model width must be halved to keep parameter counts comparable to baselines (Lewandowski et al., 27 Oct 2024).
Theory-Practice Gaps: Theoretical analyses often focus on simplified or linearized models; the precise interplay of optimization and Fourier feature selection in deep nonlinear architectures remains partly unresolved.

In summary, learnable Fourier features constitute a unifying framework underpinning scalable, adaptive spectral representations in both kernel and neural architectures, with broad applicability across contemporary ML tasks and evidence-supported advantages over fixed-basis and random-feature approaches (Wesel et al., 2021, Lewandowski et al., 27 Oct 2024, Rubel et al., 10 Sep 2025, Li et al., 2021, Li et al., 2021, Jeong et al., 3 Sep 2024, Letarte et al., 2018, Yang et al., 24 Sep 2025).