Random Fourier Features: Scalable Kernel Methods

Updated 24 December 2025

Random Fourier Features are a technique that approximates shift-invariant kernels using random finite-dimensional embeddings derived from Bochner’s theorem.
They achieve a convergence rate of O(m⁻¹ᐟ²) with proven nonasymptotic error bounds, ensuring efficient and reliable kernel approximations.
Extensions such as adaptive sampling, operator-valued mappings, and deep architectures enhance their applicability in scalable learning tasks like SVMs and Gaussian processes.

Random Fourier Features (RFF) provide a framework for scalable, explicit feature map construction that enables approximate evaluation of kernel methods for large-scale statistical learning and signal processing. By leveraging Bochner's theorem, RFF enables the approximation of positive-definite, shift-invariant kernels using randomized finite-dimensional embeddings, replacing expensive Gram matrix computations with efficient inner products in Euclidean space. The technique generalizes to a wide range of kernels, operator-valued settings, sequence kernels, and even to adaptive and data-dependent variants.

1. Mathematical Foundation and Construction

RFF are underpinned by Bochner's theorem, which states that any continuous, shift-invariant, positive-definite kernel $k(x, y) = k(x-y)$ on $\mathbb{R}^d$ admits a spectral representation: $k(x, y) = \int_{\mathbb{R}^d} e^{i \omega^\top (x - y)} p(\omega) d\omega$ where $p(\omega)$ is a nonnegative spectral density. Sampling $\{\omega_j\}_{j=1}^m$ i.i.d. from $p(\omega)$ , and $b_j \sim \mathrm{Uniform}[0, 2\pi]$ , one constructs the random feature map: $\phi(x) = \sqrt{\frac{2}{m}} \,\left[ \cos(\omega_1^\top x + b_1), \dots, \cos(\omega_m^\top x + b_m) \right]^\top$ yielding the unbiased kernel estimator: $k(x, y) \approx \phi(x)^\top \phi(y)$ The convergence rate is $O(m^{-1/2})$ in the number of features, with formal guarantees in both uniform and $L^r$ norms, and these rates are tight in general (Sriperumbudur et al., 2015).

For isotropic kernels $k(x, y) = k(\|x-y\|)$ , RFF generalizes through spectral mixture representations: the spectral measure of the kernel can often be expressed as a scale mixture of stable laws, enabling efficient sampling for a broad class of kernels beyond the Gaussian, such as generalized Cauchy, Matérn, Beta, Kummer, and Tricomi kernels (Langrené et al., 5 Nov 2024).

2. Approximation Properties, Rates, and Error Estimation

Several works have established fine-grained, nonasymptotic generalization error bounds for RFF-based learning. The error in approximating a kernel matrix is: $\| \hat{k} - k \|_{S \times S} \leq C \frac{ \sqrt{ \ln(|S|) } }{ \sqrt{m} }$ where $S$ is the data domain and $m$ is the feature count (Sriperumbudur et al., 2015). Precise rates for function and derivative approximations, including higher-order derivatives of shift-invariant kernels, match this $O(m^{-1/2})$ scaling (Szabo et al., 2018). Error propagation into downstream kernel methods—such as kernel ridge regression and SVMs—can be numerically estimated via a bootstrap approach that avoids overly conservative worst-case bounds (Yao et al., 2023).

Empirical risk minimization in the RFF-embedded space enjoys risk convergence at the minimax $O(n^{-1/2})$ rate, with minimax-optimal number of features scaling as $m = O(\sqrt{n}\log n)$ for plain RFF (Li, 2021, Li et al., 2018). Leverage- or task-adaptive sampling schemes can further reduce the required $m$ , with problem-dependent minimums as low as $O(1)$ in favorable finite-rank cases (Li, 2021, Li et al., 2018).

3. Adaptive and Data-dependent Frequency Sampling

Conventional RFF draws frequencies from the global spectral density, but optimal rates are achieved when the sampling density reflects the target function's spectral content. Adaptive schemes learn or resample frequency distributions during training, focusing computational resources on "important" frequency bands. This is formalized variationally: the optimal density is proportional to the modulus of the target Fourier coefficients, $p_*(\omega) \propto |\hat{f}(\omega)|$ (Kammonen et al., 2020, Huang et al., 3 Sep 2025).

Algorithms employing Metropolis-Hastings sampling update frequency supports dynamically, accepting new proposals in proportion to the relative amplitudes $|\hat{\beta}'_k|^\gamma / |\hat{\beta}_k|^\gamma$ (Kammonen et al., 2020). Asymptotically, the empirical frequency histogram converges to the ideal $p_*$ density. Convergence guarantees and computational reductions are rigorously established for both regression and classification, with empirical results on standard tasks (e.g., MNIST) confirming faster convergence and improved accuracy relative to classical RFF (Kiessling et al., 2021, Huang et al., 3 Sep 2025, Kammonen et al., 2020).

4. Extensions: Operator-valued, Sequence, and Deep Kernel Representations

RFF generalizes beyond scalar kernels:

Operator-valued kernels: The RFF construction extends via a matrix-valued Bochner representation, enabling randomized feature maps for vector-valued RKHS and multitask problems. Uniform approximation and concentration results are established using Bernstein-type inequalities in the matrix-valued case (Brault et al., 2016).
Signature kernels for sequences: Tensorized RFF yield unbiased, linear-in-length approximations to (truncated) signature kernels, reducing computational complexity from quadratic to linear in sequence length, while retaining uniform approximation guarantees (Toth et al., 2023).
Deep architectures: Layered (deep) RFF allow for learnable, highly expressive (deep) kernels through end-to-end training, with empirical evidence for improved flexibility and performance over shallow kernel methods, especially on small or structured datasets (Xie et al., 2019, Fang et al., 2020).

5. Implementation and Optimization: Quantization, Low-rank Factorization, and Scalability

Several technical augmentations make RFF even more practical:

Quantization: Lloyd–Max (LM-RFF and LM²-RFF) quantization schemes exploit the universal marginal distribution of RFF, achieving nearly full-precision performance with 2- or 3-bit representations, yielding 10–20× reductions in storage/computation (Li et al., 2021).
Tensor decompositions: For deterministic (quadrature-based) feature maps and multi-dimensional inputs, low-rank tensor decompositions (e.g., CPD) enable exponential approximation rates while alleviating the curse of dimensionality from $O(\hat{M}^D)$ to $O(D \hat{M}^2 R^2)$ (Wesel et al., 2021).
ANOVA/Boosting: Decomposing functions via ANOVA structure allows boosting and interpretability, targeting the RFF model to low-order interactions and improving performance on sparse/high-dimensional functions (Potts et al., 3 Apr 2024).
Surrogate leverage sampling: Fast, computationally efficient surrogates for data-dependent leverage-score–weighted feature sampling provide further improvements in accuracy with reduced complexity (Liu et al., 2019).

6. Generalizations and Theoretical Extensions

RFF are generalized to:

Asymmetric kernels: By leveraging complex measures and extended Bochner representations, RFF constructions can approximate asymmetric and signed kernels beyond the positive-definite (PD) case (He et al., 2022).
General isotropic kernels: Any isotropic positive-definite kernel whose profile admits a Laplace-mixture representation as a scale mixture of $\alpha$ -stable laws allows RFF construction by sampling from the corresponding stable mixture, enabling efficient RFF constructions for generalized Matérn, Cauchy, Beta, Kummer, and Tricomi kernels (Langrené et al., 5 Nov 2024).
Phase transitions and double descent: In the high-dimensional, proportional limit ( $n,p,m$ large and comparable), RFF Gram matrices undergo sharp phase transitions in learning error, with double descent behavior as the number of features passes through critical thresholds (Liao et al., 2020).

7. Applications, Empirical Performance, and Practical Considerations

RFF-based models are deployed in support vector machines, kernel ridge regression, Gaussian processes, wind field interpolation, and high-dimensional sensitivity analysis, among others (Kiessling et al., 2021, Xie et al., 2019, Potts et al., 3 Apr 2024). Empirical studies consistently show RFF and its advanced variants outperforming classical kernel methods and basic interpolators on large, sparse, or structured datasets.

Feature quantization and adaptive sampling offer substantial resource savings with negligible or improved predictive error. Bootstrap-based error estimation supports real-time adaptive computation, enabling users to balance computational cost and accuracy during model selection or deployment (Yao et al., 2023).

A common thread across the literature is the duality between approximation (sampling) optimality and computational tractability, achieved by aligning the frequency sampling to the spectral structure of the target and regularization task (Li, 2021, Li et al., 2018, Liu et al., 2019, Huang et al., 3 Sep 2025). These developments anchor RFF as a foundational tool for scalable, versatile, and theoretically robust kernel learning.