Random Features for Scalable Kernels

Updated 9 May 2026

Random Features are a technique that approximates nonlinear kernel functions using stochastic embeddings and finite-dimensional feature maps.
They leverage optimized designs like orthogonal and simplex features to reduce estimator variance, boosting accuracy and efficiency.
RF methods extend to deep learning, time series, and control systems, offering scalable, provable statistical guarantees in complex applications.

Random features (RF) are a foundational paradigm for scalable approximation of kernel methods and nonparametric modeling in modern machine learning. RFs stochastically embed nonlinear kernel evaluations into finite-dimensional inner-product computations via randomly sampled feature maps, enabling linear models to achieve nonlinear expressive power with greatly improved computational efficiency. RF techniques are central in kernel regression/classification, scalable Gaussian processes, deep and hierarchical kernel learning, and efficient self-attention in transformers. Over the past decade, methodologies for constructing, analyzing, and optimizing RF schemes have seen rapid evolution—encompassing new variance-reduced couplings, universal and non-universal regimes, explicit regularization effects, positive and non-trigonometric mechanisms, and extensions beyond classical scalar-valued targets.

1. Foundations of Random Features: Construction and Mapping

RFs transform a positive-definite kernel $k:\mathcal X \times \mathcal X \to \mathbb R$ into an inner product by sampling random variables $\omega$ from an appropriate distribution and defining a feature map $\varphi(\omega;x)$ . Classical random Fourier features (RFFs) approximate shift-invariant kernels $K(x,y) = K(x-y)$ via Bochner’s theorem by

$K(x-y) = \int_{\mathbb R^d} p(w) e^{iw^\top (x-y)} dw$

with $w \sim p(w)$ and $\varphi(x) = \sqrt{\frac{1}{m}} [ \cos(w_i^\top x), \sin(w_i^\top x) ]_{i=1}^m$ . The Monte Carlo sum $\hat K(x, y) = \varphi(x)^\top \varphi(y)$ is unbiased and converges in $L^2$ as $m \to \infty$ (Reid et al., 2023). For kernels lacking shift invariance, alternative spectral decompositions or numerical approximations are available, with compositional random features yielding efficient, sparse representations for structured kernels such as those associated with deep or convolutional neural networks (Daniely et al., 2017).

RFs have also been generalized beyond scalar outputs to vector-valued outputs, time series with alignment-aware random warping features (Wu et al., 2018), and control-affine architectures (Kazemian et al., 2024), with corresponding construction of higher-dimensional feature maps.

2. Optimizing RF Schemes: Variance Reduction and Geometric Couplings

The efficiency of RF-based kernel approximation is governed by estimator variance. Significant progress has been made in variance reduction through both geometric design and optimal transport couplings. Orthogonal random features (ORFs) exploit the rotational invariance of the Gaussian distribution by employing Haar-random orthogonal matrices to ensure exact mutual orthogonality of the projection directions, reducing variance over i.i.d. construction (Reid et al., 2023).

Beyond orthogonality, simplex random features (SimRFs) arrange directions as the vertices of a regular simplex in $\omega$ 0 (i.e., $\omega$ 1 for $\omega$ 2), maximizing obtuse angular separation and providing the minimal mean square error (MSE) among all weight-independent geometrically coupled positive RF mechanisms (Reid et al., 2023). Further extensions, such as SimRFs $\omega$ 3, introduce weight-dependent direction-norm correlations to achieve additional asymptotic variance reduction in specialized regimes, though at higher pre-processing cost.

Optimal transport-driven couplings, including orthogonalization and pairwise norm-coupling (PNC), minimize kernel estimator variance by leveraging the negative-monotone constraint for paired $\omega$ 4-distributed norm variables, yielding further improvement over block-orthogonal baselines (Reid et al., 2024).

Non-trigonometric positive exponential RF families (e.g., OPRF, SDERF) are explicitly constructed to optimize parameterizations for minimal variance, including closed-form solutions for data-dependent log-variance objectives (Likhosherstov et al., 2023, Likhosherstov et al., 2022).

3. Universality, Asymptotic Risk, and Breakdown of Gaussian Equivalence

Universality principles establish when non-Gaussian RF embeddings behave asymptotically like linear Gaussian surrogates, allowing precise nonasymptotic or scalarized characterizations of training and generalization risks. In proportional growth regimes ( $\omega$ 5 at fixed ratios), universality holds for a broad class of activation functions and convex regularizers. Here, random feature regression with general convex penalties, including $\omega$ 6 and elastic net, admits computation of exact learning curves and phase transitions (e.g., double descent) via a four-dimensional scalar minimax problem, exploiting hierarchical applications of the convex Gaussian min-max theorem (CGMT) (Bosch et al., 2022).

However, Gaussian equivalence can break down in more exotic scaling regimes, particularly the quadratic scaling ( $\omega$ 7) or when target functions depend on low-dimensional projections. In these cases, surviving low-dimensional Hermite (Wiener chaos) components render the distribution of RF-induced functions non-Gaussian, and benchmark phenomena such as test error curves, boundary shapes, and double descent peaks become inaccurately predicted by naive surrogate models. Conditional Gaussian equivalence (CGE) surrogates that condition on the relevant low-dimensional chaos survive and yield correct limiting risk predictions (Wen et al., 3 Dec 2025).

Deep random feature models, when analyzed via CGMT and universality, exhibit depth-dependent kernel spectrum shrinkage which directly impacts generalization, and only the first two moments at each layer are asymptotically relevant (Bosch et al., 2023).

4. Implicit Regularization and Bayesian Interpretation

Finite random features induce an implicit regularization effect in kernel ridge regression: the averaged RF predictor is closely matched by a kernel-ridge solution with an enlarged effective ridge parameter $\omega$ 8, converging monotonically as the number of features grows (Jacot et al., 2020). Precise random-matrix-theoretic analysis quantifies this relationship via a fixed-point equation,

$\omega$ 9

for $\varphi(\omega;x)$ 0 and $\varphi(\omega;x)$ 1 the eigenvalues of the kernel Gram matrix.

The Bayesian viewpoint aligns ridge-regularized RF regression with a Gaussian prior and likelihood, supporting tractable posterior predictive formulas and uncertainty estimates. Robust Bayesian RF regression under $\varphi(\omega;x)$ 2-contaminated priors and $\varphi(\omega;x)$ 3-contaminated likelihoods admits efficient upper and lower bounds on predictive intervals and variances, with the classical double-descent phase transition structure preserved under moderate contamination (Caprio et al., 22 Feb 2026).

5. Extensions to Non-Standard and Structured Domains

Random features have been extended to non-Euclidean settings, sequence data, and control systems:

Time series: Random Warping Series (RWS) approximates alignment-aware kernels for dynamic time warping, reducing complexity from $\varphi(\omega;x)$ 4 to $\varphi(\omega;x)$ 5 and providing uniform convergence with optimal $\varphi(\omega;x)$ 6 sample complexity (Wu et al., 2018).
Compositional kernels: RFs structured via computation skeletons and power series activations yield efficiently computable and memory-compact embeddings for deep or convolutional architectures (Daniely et al., 2017).
Control-affine models: RF-based embeddings for control-affine systems preserve convexity necessary for downstream optimization-based controller synthesis, with architecture-specific random feature maps for ADP and AD kernels (Kazemian et al., 2024).

6. Statistical Learning Guarantees, Consistency, and Sample Complexity

For both scalar and vector-valued problems, RF ridge regression achieves strong consistency and minimax-optimal excess risk rates, $\varphi(\omega;x)$ 7, with a sharp feature-to-sample complexity tradeoff: $\varphi(\omega;x)$ 8 random features suffice (without extra log-factors) for optimal generalization error, and $\varphi(\omega;x)$ 9, $K(x,y) = K(x-y)$ 0 are needed for a target MSE $K(x,y) = K(x-y)$ 1 (Lanthaler et al., 2023). Refined analyses for non-smooth losses (quantile regression) and agnostic settings show that minimax rates are achievable via data-dependent ("leverage-score") sampling, reducing $K(x,y) = K(x-y)$ 2 from $K(x,y) = K(x-y)$ 3 to sublinear $K(x,y) = K(x-y)$ 4 (Wang et al., 2024).

Double descent phenomena, originally observed for kernel and neural networks, are rigorously shown to persist in both full and contaminated Bayesian RF regimes, as well as for RF regression trained by SGD in various step-size schedules. The excess risk's dependence on the over-parameterization ratio $K(x,y) = K(x-y)$ 5 is explicitly characterized, with monotonic bias decay and a variance curve that peaks at the interpolation threshold, substantiating the practical use of SGD over exact solvers (Liu et al., 2021, Caprio et al., 22 Feb 2026).

7. Practical Algorithms, Implementation, and Applications

RF implementation in practice is dominated by transform evaluations, vector-matrix multiplication, and sparse hashing for compositional/de-duplicated schemes. Modern positive, bounded, and non-trigonometric RF architectures (e.g., FAVOR#, OPRF, GERF, CRT) are designed for efficient and robust self-attention in transformers, often leveraging block-orthogonalization and variance optimization (Likhosherstov et al., 2023, Likhosherstov et al., 2022). Algorithms with variance-reducing couplings exploit orthogonalization, pairwise/antithetic norm matching, and optimal transport-based matchings to maximize approximation quality per unit compute (Reid et al., 2024). Domain-specific RFs (time-series, graph, control) require specialized data-dependent or structure-preserving construction of the random feature maps.

Empirical evaluations across UCI datasets, large-scale vision and NLP transformers, control benchmarks, and graph inference tasks consistently confirm theoretical improvements in variance, accuracy, and computational efficiency delivered by optimized RF schemes.

References:

"Simplex Random Features" (Reid et al., 2023)
"Variance-Reducing Couplings for Random Features" (Reid et al., 2024)
"Chefs' Random Tables: Non-Trigonometric Random Features" (Likhosherstov et al., 2022)
"FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features" (Likhosherstov et al., 2023)
"Random Features Model with General Convex Regularization: A Fine Grained Analysis with Precise Asymptotic Learning Curves" (Bosch et al., 2022)
"When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling" (Wen et al., 3 Dec 2025)
"Precise Asymptotic Analysis of Deep Random Feature Models" (Bosch et al., 2023)
"Random Warping Series: A Random Features Method for Time-Series Embedding" (Wu et al., 2018)
"Random Features for Compositional Kernels" (Daniely et al., 2017)
"Random Features Approximation for Control-Affine Systems" (Kazemian et al., 2024)
"Implicit Regularization of Random Feature Models" (Jacot et al., 2020)
"Error Bounds for Learning with Vector-Valued Random Features" (Lanthaler et al., 2023)
"Optimal Kernel Quantile Learning with Random Features" (Wang et al., 2024)
"Robust Predictive Uncertainty and Double Descent in Contaminated Bayesian Random Features" (Caprio et al., 22 Feb 2026)
"On the Double Descent of Random Features Models Trained with SGD" (Liu et al., 2021)