Random Fourier Features (RFFs) Essentials

Updated 24 June 2026

Random Fourier Features (RFFs) are randomized mappings based on Bochner's theorem that approximate shift-invariant kernels using finite-dimensional cosine and sine transforms.
They reduce computational complexity by replacing large kernel matrices with lower-dimensional feature matrices while offering uniform error bounds for function approximation and derivative estimation.
Recent innovations include advanced data-dependent sampling, normalized and orthogonal variants, deep architectures, and quantized approaches that enhance efficiency and extend RFF applicability.

Random Fourier Features (RFFs) provide a randomized, explicit feature mapping for shift-invariant kernels, enabling large-scale kernel machine learning by approximating the kernel with inner products in a finite-dimensional Euclidean space. The RFF methodology, originally proposed by Rahimi and Recht in 2007, has evolved into a sophisticated framework with deep theoretical guarantees, advanced sampling strategies, quantization methods, deep architectures, error estimation techniques, and a broad extension to new classes of kernels and tasks.

1. Mathematical Foundations and Construction of RFFs

The core principle underlying RFFs is Bochner's theorem: any continuous, shift-invariant, positive-definite kernel $k(x, y) = k(x-y)$ on $\mathbb{R}^d$ can be represented as the Fourier transform of a nonnegative measure $p(w)$ . For normalized kernels:

$k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$

To approximate this expectation, one samples $D$ i.i.d. vectors $w_i \sim p(w)$ and, for the common cosine-bias formulation, $b_i \sim \text{Uniform}[0, 2\pi]$ , and defines the explicit feature map:

$\phi(x) = \sqrt{\frac{2}{D}}\left[\cos(w_1^\top x + b_1),\dots,\cos(w_D^\top x + b_D)\right]^\top$

so that, in expectation, $\phi(x)^\top \phi(y) \approx k(x, y)$ .

This construction enables classical kernel methods (SVM, KRR) to be recast as linear methods acting on $\phi(x)$ , with the principal advantage that the $\mathbb{R}^d$ 0 kernel matrix $\mathbb{R}^d$ 1 is replaced by an $\mathbb{R}^d$ 2 feature matrix $\mathbb{R}^d$ 3, dramatically reducing both time and memory complexity for large $\mathbb{R}^d$ 4 (Sriperumbudur et al., 2015).

2. Theoretical Guarantees: Approximation Rates and Derivatives

Finite-sample uniform-approximation rates for RFFs have been established. For a compact domain $\mathbb{R}^d$ 5 of diameter $\mathbb{R}^d$ 6 and $\mathbb{R}^d$ 7 features, with high probability,

$\mathbb{R}^d$ 8

Thus, $\mathbb{R}^d$ 9 features suffice for uniform error $p(w)$ 0 (Sriperumbudur et al., 2015).

These rates extend to $p(w)$ 1 norms for $p(w)$ 2 and importantly, to all mixed derivatives of shift-invariant kernels. For multi-indices $p(w)$ 3 (with regular spectral decay), RFFs can achieve the same domain-dependent rates for $p(w)$ 4 as for function values, supporting their use in tasks involving kernel derivatives (e.g., physics-informed learning, gradient-enhanced regression) (Szabo et al., 2018).

Recent empirical process results further refine uniform bounds, showing almost sure control as the input domain grows with sample size, under minimal logarithmic scaling (Sriperumbudur et al., 2015).

3. Sampling Schemes and Variance Reduction

Standard vs. Data-dependent Sampling

The vanilla RFF scheme samples $p(w)$ 5 i.i.d. from $p(w)$ 6, but this ignores the data geometry and can require a large $p(w)$ 7 for high accuracy in ill-conditioned or low-noise regimes.

Advanced, data-dependent sampling—most notably, ridge leverage score (RLS) sampling—draws $p(w)$ 8 with probability proportional to

$p(w)$ 9

This concentrates features on frequencies that contribute most to the kernel-regularized solution, reducing required $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 0 from $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 1 for vanilla RFF to $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 2, where $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 3 is the effective degrees of freedom (Li et al., 2018). Approximate and surrogate leverage sampling methods avoid expensive $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 4 computation (Liu et al., 2019).

Orthogonal and Distribution-dependent RFFs

Orthogonal random features (ORF) and Generalized Orthogonal Random Features (GORF) further reduce the estimator variance by imposing pairwise orthogonality on the frequency vectors. This achieves provably lower variance than independent RFFs for both positive-definite and indefinite stationary kernels, and leads to improved SVM/SVR accuracy and kernel matrix approximation-errors (Luo et al., 2021).

Distribution-dependent RFFs introduce an adaptive low-pass filter into the spectral density, reducing the number of features needed for a desired approximation error; dimension reduction by $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 5, with $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 6 the filter scale, is achievable for low-sample regimes (Mitra et al., 2021).

Normalization

Normalizing feature vectors (NRFF) by dividing out their norms further halves the variance of the RFF kernel estimator in moderate similarity regimes, yielding lower sample complexity at negligible computational cost (Li, 2016).

4. Deep Architectures and End-to-End Kernel Learning

Random Fourier Features can be composed in deep (multi-layer) architectures, leading to "deep kernel machines," where each layer applies an RFF map—possibly with learned parameters over $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 7—and non-linearities via cos/sin blocks. End-to-end training over all RFF module parameters is achieved by full backpropagation through the sampling and feature construction steps (Xie et al., 2019).

Such deep RFF models, when coupled with label-alignment or generative modules that output spectral samples, combine kernel generalization (robustness on small datasets) with the expressivity of deep neural nets. Progressive (layer-wise) unfreezing schedules stabilize deeper training (Fang et al., 2020, Xie et al., 2019). Empirical evidence indicates strong performance both in small-data and large-scale image classification, matching or exceeding classical kernel and deep MLP baselines (Xie et al., 2019).

5. Error Estimation, Quantization, and Memory Efficiency

Explicit Error Estimation

Conventional bounds on RFF approximation error suffer from pessimism and unknown constants. Bootstrap-based, data-driven estimators resample RFF columns to empirically estimate quantiles of kernel error or downstream learning metrics, providing accurate, problem-specific confidence intervals. These methods adapt feature dimension $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 8 to a target error at moderate computational overhead (Yao et al., 2023).

Quantization and Memory Constrained RFFs

Memory-efficient deployment is facilitated by quantized RFFs using low-bit-depth schemes: Lloyd–Max and $k(x, y) = \int_{\mathbb{R}^d} e^{i w^\top (x - y)} p(w) dw = \mathbb{E}_{w \sim p} [\cos(w^\top x) \cos(w^\top y) + \sin(w^\top x) \sin(w^\top y)]$ 9 quantizers, as well as sophisticated noise-shaping (e.g., $D$ 0) methods. Notably, the marginal distribution of a single RFF coordinate is independent of the kernel bandwidth (for Gaussian RFFs), so universal quantizers are optimal for all settings (Li et al., 2021). First- and second-order noise-shaping quantizers, even at 1–2 bits/coordinate, achieve fast decay of kernel approximation error with bits used, with controlled bias and variance for SVM, KRR, and kernel two-sample testing (Zhang et al., 2021).

Low-precision RFFs (LP-RFFs) enable high-rank approximations under strict memory budgets, providing superior held-out performance compared to low-rank (Nyström) schemes. The key determinant is the $D$ 1-spectral approximation of the kernel, which is robust to quantization noise as long as the feature count $D$ 2 is increased proportionally (Zhang et al., 2018).

6. Extensions: Beyond Classical Cases

Isotropic Kernels and Spectral Mixtures

For positive-definite isotropic kernels $D$ 3, the spectral distribution can be decomposed as a scale mixture of symmetric $D$ 4-stable distributions. This yields a unified, ready-to-use RFF sampling blueprint for exponential power, Matérn, generalized Cauchy, Beta, Kummer, and Tricomi kernels. Sampling proceeds by drawing a "radius" $D$ 5 from the kernel-specific mixing law, then scaling an $D$ 6-stable vector $D$ 7 by $D$ 8 (Langrené et al., 2024). This approach generalizes standard RFFs for the RBF (Gaussian) kernel, whose spectrum is self-similar (Gaussian). Computationally, sampling for isotropic kernels is more efficient than for tensor-product kernels (requiring $D$ 9 scalar draws vs $w_i \sim p(w)$ 0), and all classical RFF error bounds remain valid.

Asymmetric and Indefinite Kernels

The extension of RFFs to asymmetric and non–PD shift-invariant kernels is achieved by representing the Fourier spectrum as a complex or signed measure, decomposed into four non-negative measures (Bochner, Jordan). The AsK-RFFs framework provides unbiased RFF approximations for kernels beyond the PD class, with theoretical $w_i \sim p(w)$ 1 convergence, and efficient subset-based estimation for the masses (integrals of spectral components) (He et al., 2022).

7. Applications and Empirical Performance

Random Fourier Features have been widely adopted in SVMs, kernel ridge regression, Gaussian processes, deep kernel architectures, operator learning for PDEs, and machine learning for wireless communications. In operator learning, RRFFs (with frequency-weighted Tikhonov regularization and Student's $w_i \sim p(w)$ 2 features) achieve robust, noise-tolerant learning of PDE solution operators, with consistent improvements in generalization and training time over unregularized RFFs, kernel, and neural operator methods (Yu et al., 19 Dec 2025).

Empirical evaluations consistently confirm the theoretical predictions: higher-rank RFF approximations (especially under quantization), deep RFF machines, GORF/NRFF schemes, and data-dependent sampling realize effective memory- and time-efficient large-scale learning, while controlling approximation error and preserving statistical efficiency.

Summary Table: Core RFF Classes and Variants

RFF Variant	Sampling Distribution	Main Use/Advantage
Classical RFF	$w_i \sim p(w)$ 3 (kernel spectrum)	Shift-invariant, PD kernels
Leverage-RFF	Ridge leverage score, $w_i \sim p(w)$ 4	Minimax statistical efficiency
NRFF (Normalized)	Any RFF + vector norm normalization	Reduced estimator variance
GORF/Orthogonal	$w_i \sim p(w)$ 5 (orthogonal directions)	Variance minimization, indefinite kernels
Distribution-dep.	Data-adaptive $w_i \sim p(w)$ 6, low-pass filter	Fewer features, low-sample regime
Isotropic RFF	Mixture from scale law (e.g., $w_i \sim p(w)$ 7-stable)	General isotropic kernels
AsK-RFF	Complex measure (via Jordan/Bochner)	Asymmetric, non–PD kernels
Quantized/LP-RFF	Any of the above, low bit depth	Memory-efficient, bandwidth-limited
Deep RFF	Layered, $w_i \sim p(w)$ 8 learned	Expressivity, end-to-end training

References

(Sriperumbudur et al., 2015) for optimal uniform/L $w_i \sim p(w)$ 9 error rates and kernel derivative approximation
(Li, 2016) for normalized random Fourier features (NRFF)
(Li et al., 2018, Liu et al., 2019) for unified analyses and surrogate/data-dependent sampling
(Luo et al., 2021) for unbiased, orthogonal random features for indefinite/stat. kernels
(Xie et al., 2019, Fang et al., 2020) for deep architectures and end-to-end kernel learning with RFFs
(Yao et al., 2023) for bootstrap-based error estimation
(Li et al., 2021, Zhang et al., 2018, Zhang et al., 2021) for quantization and low-precision implementations
(Langrené et al., 2024) for spectral-mixture RFFs for isotropic kernels
(He et al., 2022) for AsK-RFFs, extending RFFs to asymmetric kernels
(Yu et al., 19 Dec 2025) for operator learning with regularized RFFs (RRFF-FEM)

This comprehensive landscape reflects the maturity, versatility, and ongoing innovation in the theory and application of Random Fourier Features in modern machine learning.