Papers
Topics
Authors
Recent
Search
2000 character limit reached

Positive Orthogonal Random Features

Updated 2 May 2026
  • Positive orthogonal random features are nonnegative feature maps that unbiasedly approximate kernels with reduced variance by employing geometric coupling of projection directions.
  • They use simplex-based constructions to uniformly spread feature directions, significantly lowering variance and improving kernel matrix approximations.
  • Empirical studies show up to 90% lower mean squared error and enhanced classification accuracy compared to conventional random feature methods.

Positive orthogonal random features (PRFs with orthogonality constraints) constitute a class of random feature methods designed for unbiased, nonnegative kernel approximation with reduced variance, particularly for exponential-dot-product kernels such as the softmax and Gaussian kernels. The central idea is to use feature maps whose coordinates are strictly positive and whose underlying projection directions are not merely independent, but are arranged according to specific orthogonality-like or geometric couplings, thereby minimizing estimator variance. These constructions establish foundational algorithms for scalable kernel machines and linear-attention architectures.

1. Motivation and Classical Positive Random Features

Classical random feature (RF) approximations of kernels, inspired by Bochner’s theorem, project inputs onto random independent directions, typically sampled i.i.d. from an isotropic Gaussian. For the softmax kernel k(x,y)=exp(xy)k(x, y) = \exp(x^\top y) or Gaussian kernel k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2), Performer-style positive random features (PRFs) replace trigonometric/complex RFFs with

ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).

This mapping ensures non-negativity (ϕi(x)0\phi_i(x)\ge 0 for all xx), unbiasedness (E[ϕ(x)ϕ(y)]=k(x,y)\mathbb{E}[\phi(x)^\top \phi(y)] = k(x, y)), and stability in linear-attention models due to avoidance of near-zero denominators. However, i.i.d. direction sampling results in high estimator variance, limiting accuracy and efficiency in practical approximations (Likhosherstov et al., 2023, Likhosherstov et al., 2022).

2. Orthogonality and Geometric Coupling in Random Features

Variance reduction in RF kernel approximation exploits couplings among projection directions. Orthogonal Random Features (ORFs) enforce exact orthogonality among the normalized directions {wi/wi}\{w_i/\|w_i\|\} within each block of features, “spreading out” the samples uniformly on the sphere and reducing estimator variance compared to i.i.d. sampling. The orthogonalization does not affect unbiasedness because it leaves the marginal distribution unchanged. However, for strictly positive RFs as demanded by linear-transformer applications, classical ORFs are not guaranteed to be optimally minimal-variance among all possible geometrical couplings (Reid et al., 2023).

3. Construction of Positive Orthogonal Random Features

3.1 SimRFs: Simplex Random Features

Simplex Random Features (SimRFs) represent an optimal construction in the class of weight-independent, geometrically-coupled positive RFs. The algorithm organizes mm feature directions into B=m/dB = \lceil m/d\rceil independent blocks of size dd (the ambient dimension). For each block:

  • Sample k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)0 independent norms k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)1.
  • Form the simplex-projection matrix k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)2, whose rows point to the vertices of a regular k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)3-simplex (k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)4 for k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)5).
  • Draw a random rotation k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)6 Haark(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)7.
  • Set k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)8.

The feature map is given by

k(x,y)=exp(12xy2)k(x, y) = \exp(-\tfrac{1}{2}\|x-y\|^2)9

This coupling ensures that all pairwise angles among direction vectors are as equally separated as possible, minimizing a key variance proxy (ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).0) and spreading the block vectors at maximal simplex angles (Reid et al., 2023).

3.2 SimRFs+: Weight-Dependent Geometries

SimRFs+ extends SimRFs by allowing weight-dependent couplings: directions may depend on their drawn norms. In the small-ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).1 regime (ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).2), minimizing the sum ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).3 yields an iterative re-alignment—each vector is oriented opposite to the sum of the others. SimRFs+ achieves the asymptotic minimum MSE in the full weight-dependent class but incurs higher computational cost (ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).4 optimization per block). Empirically, the gains of SimRFs+ over SimRFs are marginal for most practical cases (Reid et al., 2023).

4. Theoretical Analysis: MSE Optimality and RF-Conformity

The estimator's mean square error (MSE) for unbiased PRF mechanisms with unit-Gaussian marginals is

ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).5

with ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).6, and the “RF-conformity” term ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).7 is minimized when the direction vectors form a simplex. The simplex arrangement achieves the unique minimum of ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).8 in the class of weight-independent schemes, establishing SimRFs as MSE-optimal among all such positive orthogonal random feature mappings (Theorem 3.3, (Reid et al., 2023)). For arbitrary weight-dependent couplings, SimRFs+ approaches the theoretical minimum in the small-ϕi(x)=1mexp(wix12x2),wiN(0,I).\phi_i(x)=\sqrt{\frac{1}{m}} \exp(w_i^\top x - \frac{1}{2}\|x\|^2), \quad w_i\sim \mathcal{N}(0, I).9 regime.

5. Algorithmic Implementation and Computational Complexity

Algorithmic details for a block of ϕi(x)0\phi_i(x)\ge 00 features in SimRFs:

  • Draw norms, assemble the simplex-projection matrix, sample a random rotation.
  • Map inputs with sequential steps: random rotation (ϕi(x)0\phi_i(x)\ge 01 generation, ϕi(x)0\phi_i(x)\ge 02 application; optionally replaced by Hadamard transforms for ϕi(x)0\phi_i(x)\ge 03), simplex projection (ϕi(x)0\phi_i(x)\ge 04), diagonal scaling.
  • Output: strictly positive feature map ϕi(x)0\phi_i(x)\ge 05.

The computational cost and memory of SimRFs matches ORFs asymptotically. Using fast structured transforms (e.g., randomized HD-product) reduces per-block processing and memory (Reid et al., 2023).

6. Empirical Comparisons and Practical Significance

Empirical studies consistently demonstrate the superiority of SimRFs over both i.i.d. PRFs and classical ORFs in kernel approximation and learning applications.

  • Pointwise MSE: SimRFs exhibit up to 90% lower MSE than i.i.d. PRFs and 20–60% lower than ORFs for ϕi(x)0\phi_i(x)\ge 06, especially for small ϕi(x)0\phi_i(x)\ge 07.
  • Gram-matrix Approximation: Frobenius-norm errors in approximating the kernel matrix are uniformly lowest for SimRFs.
  • Nonparametric Classification: On eight UCI tasks, SimRF-based kernel regression yields 1–5% higher test accuracy relative to ORFs and 5–20% higher than i.i.d. PRFs.
  • Linear Transformers (Performers): Replacements of ORFs by SimRFs in ViT architectures on datasets such as ImageNet-1K and Fashion-MNIST result in up to +0.5% top-1 accuracy gains without runtime increase (Reid et al., 2023).

A plausible implication is that further reductions in variance—while always beneficial—have task-dependent effects, with some datasets exhibiting larger accuracy gains from SimRFs.

SimRFs, and more generally positive orthogonal random features, refute the presumption that strict orthogonality in projection directions is optimal for MSE reduction among positive RFs. By leveraging simplex geometry, SimRFs minimize variance in an unbiased kernel approximation while maintaining strict nonnegativity, making them particularly well-suited for low-rank approximations in kernel machines and deep linear-attention models. SimRFs+ demonstrates the possibility of marginally further gains by optimizing over weight-dependent geometries at increased computational expense.

Open questions include:

  • Characterizing the precise conditions under which further variance reduction (as in SimRFs+) yields substantial task-dependent improvements.
  • Determining the full non-asymptotic optimum among weight-dependent couplings beyond small-ϕi(x)0\phi_i(x)\ge 08 series expansions.
  • Investigating the application of quasi-Monte-Carlo or low-discrepancy sampling methods to positive random feature constructions (Reid et al., 2023).

These developments establish the foundation for future advances in positive, structure-coupled random feature maps for scalable kernel and attention-based learning architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positive Orthogonal Random Features.