Random Fourier Features: Scalable Kernel Methods

Updated 10 January 2026

Random Fourier Features (RFF) are a randomized approximation method that uses Bochner’s theorem to convert shift-invariant kernels into finite-dimensional feature maps.
They enable the use of linear algorithms for tasks like ridge regression and SVMs by reducing computational costs and memory requirements in high-dimensional or large-scale settings.
Recent extensions include spectral mixtures, quantization, variance reduction, and deep layer integration, widening their application in scalable and robust kernel learning.

Random Fourier Features (RFF) are a randomized functional approximation technique for shift-invariant kernels, providing scalable and theoretically justified dimensionality reduction for kernel methods. RFFs allow explicit embedding of data into finite-dimensional feature spaces such that inner products approximate kernel evaluations, facilitating the use of linear algorithms for kernel-based learning in high-dimensional or large-scale settings. The foundational idea is based on Bochner’s theorem, which characterizes the Fourier transform relationship between positive-definite, shift-invariant kernels and probability measures. Recent advances have expanded classical RFFs substantially: they now generalize to broad kernel families using scale-mixture spectral representations, enable specialized quantization and compression schemes, address asymmetric and indefinite kernels, and are amenable to deep and end-to-end architectures.

1. Foundations and Classical Construction

Let $k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ be a shift-invariant, positive-definite kernel: $k(x, y) = \psi(x-y)$ . Bochner’s theorem guarantees

$\psi(z) = \int_{\mathbb{R}^d} e^{i \omega^T z} \, p(\omega)\ d\omega$

where $p(\omega)$ is a nonnegative, normalized spectral density. The canonical RFF algorithm samples frequencies $\omega_1, \ldots, \omega_m \sim p(\omega)$ (and optional phases $b_j \sim \mathrm{Unif}[0, 2\pi]$ ) and constructs a feature map

$z(x) = \sqrt{\frac{2}{m}}\ [\cos(\omega_1^T x + b_1),\ldots,\cos(\omega_m^T x + b_m)]^T \in \mathbb{R}^m$

yielding the unbiased approximation $k(x, y) \approx z(x)^T z(y)$ (Sriperumbudur et al., 2015).

This machinery drastically reduces the computational cost of kernel methods, removing the need to explicitly compute or store large Gram matrices: learning algorithms like ridge regression, SVMs, and clustering transition to linear complexity in both $n$ (sample size) and $m$ (feature dimension).

2. Spectral Mixture Generalizations of RFF

The classical RFF algorithm is primarily used with Gaussian kernels due to sampling simplicity. Recent work introduces a unified scheme for isotropic positive-definite kernels in $\mathbb{R}^d$ by expressing spectral densities as scale mixtures of symmetric $\alpha$ -stable distributions (Langrené et al., 2024). For any kernel $K(\|x\|)$ expressible as $K(u) = L_R(\lambda \|u\|^\alpha)$ (with Laplace-Stieltjes transform $L_R$ and $\lambda>0$ ), the spectral density $p_k(\omega)$ admits the representation

$p_k(\omega) = \int_0^\infty p_{S_\alpha}\left( \frac{\omega}{(\lambda r)^{1/\alpha}} \right) (\lambda r)^{-d/\alpha} \pi_k(r) dr$

where $S_\alpha$ is an $\alpha$ -stable random vector and $\pi_k(r)$ is the mixing density recovered by inverse Laplace transform of the radial profile.

This covers classical kernels (Gaussian, Matérn, generalized Cauchy) and newly introduced ones (Beta, Kummer, Tricomi) by specifying $\pi_k(r)$ . The mixture construction yields simple, “plug-and-play” frequency sampling algorithms for any such kernel, extending RFF applicability far beyond the Gaussian case.

Kernel Type	$K(\\|x\\|)$ Formula	$\pi_k(r)$ (Mixing Law)
Gaussian (exp-power)	$\exp(-\\|x\\|^2)$	Dirac at $r=1$
Matérn- $\nu$	$(\sqrt{2\nu}\\|x\\|)^\nu K_\nu(\sqrt{2\nu}\\|x\\|)/[2^{\nu-1}\Gamma(\nu)]$	Inv-Gamma $(r;\nu,\nu)$
Gen. Cauchy	$[1+(\\|x\\|^\alpha)/(2\beta)]^{-\beta}$	Gamma $(r;\beta,1/(2\beta))$
Kummer/Beta/Tricomi	See (Langrené et al., 2024) for functional forms	Beta distributions, generalized $F$

Sampling for RFFs is achieved via: 1) draw $r$ from $\pi_k(r)$ , 2) simulate a symmetric $\alpha$ -stable vector $S_\alpha$ via a scale-mixture of Gaussians, and 3) set $\omega = (\lambda r)^{1/\alpha} S_\alpha$ .

3. Approximation Rates, Error Bounds, and Derivative Estimates

RFFs uniformly approximate $k(x, y)$ over compact sets with optimal rates. For $m$ features and domain $S$ of diameter $|S|$ ,

$\sup_{x,y\in S} |k(x,y) - z_m(x)^T z_m(y)| = O_p\left(\sqrt{\frac{d \log |S|}{m}}\right)$

with the log factor determined by covering entropy (Sriperumbudur et al., 2015).

This $m^{-1/2}$ scaling is minimax optimal in empirical process theory. $L^r$ norms converge at the same rate, with constants depending on $S$ and the kernel’s moments. Derivatives of kernels are approximatable by augmenting feature maps with appropriately-weighted frequency powers and phase shifts, yielding analogous rates under mild moment or bounded-support conditions on the spectral measure.

4. Extensions: Asymmetric, Indefinite, and Data-Adaptive Kernels

Bochner’s theorem restricts classical RFFs to symmetric and positive-definite kernels. Asymmetric and indefinite kernels require generalization: decomposing the complex Fourier measure into signed or complex parts and systematically sampling from the Jordan subcomponents (He et al., 2022, Luo et al., 2021). The AsK-RFFs framework specifies four finite positive measures whose combinations (and associated feature maps) allow unbiased, uniform-convergent approximation for nonstandard kernels (e.g., directed graphs, conditional probability, non-PD similarity).

Variance reduction schemes, such as Generalized Orthogonal Random Features (GORF), further decrease approximation error by sampling orthogonal directions in the frequency domain, proven to deliver strictly lower variance than i.i.d. variants for indefinite kernels (Luo et al., 2021).

Data-dependent RFF selection via leverage-score weighting or teacher–learner hybrid optimization can reduce the required feature dimension for optimal risk by a factor depending on the kernel’s effective degrees of freedom, enabling minimax rates with substantially fewer features (Li et al., 2018, Wangni et al., 2017).

5. Quantization, Compression, and Numerical Estimation

Efficient deployment of RFFs in large-scale scenarios demands quantization and compression. Classical Lloyd-Max quantization can utilize the marginal distribution of a single RFF (shown to be independent of the kernel parameter for Gaussian kernels) to compactly encode features using as few as 1–2 bits per feature; the LM-RFF and LM $^2$ -RFF (variance-minimizing for high similarity) schemes are optimal under respective criteria (Li et al., 2021). Noise-shaping quantization protocols (Sigma-Delta, distributed $\beta$ -quantization) yield quantized embeddings whose kernel approximation error can decay exponentially in the bit-rate and polynomially in the feature dimension (Zhang et al., 2021).

Error quantification for RFF kernel approximations, previously limited to conservative theoretical bounds, can now employ bootstrap-based empirical estimation to compute data-adaptive confidence intervals for kernel errors and downstream prediction metrics (e.g., test MSE, SVM risk). These numerical techniques enable adaptive selection of feature dimension $D$ without knowledge of hidden kernel parameters or reliance on worst-case guarantees (Yao et al., 2023).

6. Deep, End-to-End, and Multilayer RFF Architectures

RFFs are no longer limited to shallow kernel approximators. Deep architectures compose multiple layers of RFF modules: each layer may instantiate trainable kernel parameters (“deep kernel learning”), spectral distributions, or even parametric generator networks whose weights are learned end-to-end jointly with downstream learners (Xie et al., 2019, Fang et al., 2020). Stacked RFF layers induce composite kernels $K^{(L)}(x, y) \approx \langle s^{(L)}(x), s^{(L)}(y)\rangle$ with vastly increased expressive capacity, enabling both state-of-the-art small-data generalization and large-scale competitiveness.

End-to-end generative RFF frameworks obviate separate feature selection and linearization stages, directly maximizing empirical risk with respect to both kernel parameters and linear classifiers. Experimental evaluations have shown improved generalization and adversarial robustness over classical methods, especially with multi-layer generator designs and randomized resampling of kernel weights (Fang et al., 2020).

7. Statistical Learning Theory and Large-Scale Regimes

Statistical guarantees for RFFs are firmly established for a broad range of losses (squared, Lipschitz, classification), with error rates governed by regularization, the kernel’s spectral decay, and feature dimension. For kernel $k$ -means and clustering, recent theory shows excess risk $\mathcal{O}(\sqrt{k^3/n})$ matches the full-kernel optimal rate, and relative error bounds of $(1+\varepsilon)$ can be achieved with polylogarithmic feature dimension in $k$ (Chen et al., 13 Nov 2025, Cheng et al., 2022). When the input, data, and RFF feature dimensions all scale comparably large ( $n, p, N\to\infty$ ), random matrix theory precisely describes the spectral behavior of the RFF Gram matrix and reveals phase transitions and the double descent phenomenon in kernel regression and classification (Liao et al., 2020).

Isotropic scale-mixture representations enable concrete error analyses for all covered kernel families, with mean-square deviation and uniform bounds matching the classical $O(1/\sqrt{m})$ scaling (Langrené et al., 2024).

8. Practical Implications and Applications

The spectral mixture representation and associated sampling algorithms enable RFFs for nearly any isotropic shift-invariant kernel; classical, generalized, and newly constructed kernel families are all covered. Sampling complexity is $O(d)$ per feature, and accuracy is governed purely by feature dimension. Variance reduction, quantization, and compression protocols (LM, Sigma-Delta, distributed quantization) make RFFs exceptionally efficient in memory and computational load at scale. Error estimation and adaptive selection (bootstrap, data-dependent sampling) further close the gap between theoretical and practical usage.

RFFs are employed in kernel ridge regression, SVMs, power $k$ -means, Gaussian processes, neural operator learning, and recently in quantum ML “dequantization”—bridging the risk gap between quantum kernel/ridge/SVM models and classical RFF approximations for regression and classification (Sahebi et al., 21 May 2025). Operator learning in infinite-dimensional function spaces leverages RRFF (regularized RFF) and FEM-coupled reconstruction for robust PDE solution operators, with explicit spectral and sampling guarantees even under noise (Yu et al., 19 Dec 2025).

The RFF machinery—now generalized and extended—provides a universal toolkit for scalable, expressive, and statistically rigorous kernel learning across a range of theoretical and applied machine learning problems.