Papers
Topics
Authors
Recent
Search
2000 character limit reached

FAVOR+: Positive Random Fourier Features

Updated 2 May 2026
  • FAVOR+ is a positive random feature map that uses exponentiated Gaussian projections to unbiasedly approximate the softmax kernel while ensuring nonnegative features.
  • It provides an unbiased Monte Carlo estimator for exponential kernels, making it valuable for scalable self-attention in Transformer models.
  • Despite its computational efficiency, FAVOR+ exhibits high variance compared to advanced methods, motivating tunable alternatives like GERFs and DERFs.

Positive Random Fourier Features (FAVOR+), introduced in the Performer framework and analyzed in FAVOR# (Likhosherstov et al., 2023), are a class of positive, non-trigonometric random feature maps designed for efficient and unbiased approximation of exponential kernels, notably the softmax kernel K(x,y)=exp(xy)K(x, y) = \exp(x^\top y). Unlike classical Random Fourier Features (RFFs) that leverage sine and cosine transforms, FAVOR+ constructs strictly positive random features by exponentiating random linear projections. This approach ensures entrywise nonnegativity in kernel and attention approximations, maintaining stability after row-wise renormalization—an essential property for scalable linear attention mechanisms in Transformer models.

1. Definition and Construction of FAVOR+

FAVOR+ is defined by the positive random feature map

fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)

where ωN(0,Id)\omega \sim \mathcal{N}(0, I_d) and xRdx \in \mathbb{R}^d. Given MM independent samples ω(1),,ω(M)\omega^{(1)},\ldots,\omega^{(M)}, the MM-dimensional feature vector is

φ(x)=1M[exp((ω(m))xx2/2)]m=1M.\varphi(x) = \frac{1}{\sqrt{M}} \left[ \exp\left((\omega^{(m)})^\top x - \|x\|^2/2\right) \right]_{m=1}^M.

The approximate kernel value between xx and yy is then computed as

fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)0

Unlike the original Rahimi–Recht RFFs, which use trigonometric functions and can generate negative values, FAVOR+'s features are always positive. This ensures the resulting Gram matrix is entrywise nonnegative, so the standard row-normalization step preserves the stochastic property required in softmax approximations.

2. Unbiasedness and Recovery of the Softmax Kernel

The construction of FAVOR+ yields an unbiased Monte Carlo estimator for the softmax kernel:

fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)1

This is established by noting:

fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)2

by direct computation of the Gaussian integral after completing the square in the exponent.

The same methodology extends to the Gaussian kernel fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)3, by using a simple re-weighting of fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)4.

3. Variance Properties and Comparisons

FAVOR+ exhibits high variance compared to alternative random feature methods. The variance of fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)5 is

fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)6

There are no tunable parameters in fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)7, so analytic variance reduction is not possible in this method. In empirical benchmarks (Section 5.1, Figure 1), the per-pair relative variance

fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)8

for FAVOR+ is in the range fpos(ω,x)=exp(ωxx2/2)f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)9--ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)0 on CIFAR-10–derived samples at ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)1, orders of magnitude higher than the newly introduced SDERF, which achieves relative variance below ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)2. This large variance underlies the motivation for richer, parameterized feature maps such as GERFs and DERFs, which admit closed-form variance minimization and achieve variance reduction by factors exceeding ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)3.

4. Self-Attention Approximation in Transformers via FAVOR+

In standard Transformers, the ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)4 self-attention matrix is

ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)5

where ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)6, ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)7 are the sequences of queries and keys. FAVOR+ replaces explicit softmax computation with a low-rank approximation:

  • For each query ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)8 and key ωN(0,Id)\omega \sim \mathcal{N}(0, I_d)9, compute xRdx \in \mathbb{R}^d0 and xRdx \in \mathbb{R}^d1.
  • Construct matrices xRdx \in \mathbb{R}^d2 and xRdx \in \mathbb{R}^d3.
  • Form the rank-xRdx \in \mathbb{R}^d4 estimate xRdx \in \mathbb{R}^d5.
  • Compute the layer output as xRdx \in \mathbb{R}^d6.

This approach reduces self-attention complexity from xRdx \in \mathbb{R}^d7 to xRdx \in \mathbb{R}^d8 while guaranteeing a valid stochastic matrix after normalization due to entrywise positivity.

5. Performance Benchmarks and Empirical Results

FAVOR+ demonstrates consistently higher empirical variance and lower accuracy compared to richer random feature classes such as GERFs, SADERF, and SDERF. In kernel regression tasks on UCI classification benchmarks (Section 5.2), FAVOR+ attains accuracy several percentage points below FAVOR#, which employs SDERF. In long-sequence Transformer applications:

  • On the LibriSpeech Conformer-Transducer, the normalized word error rate (NWER) for FAVOR+ is higher than FAVOR++ at small xRdx \in \mathbb{R}^d9, while FAVOR# achieves up to a MM0 absolute NWER reduction versus FAVOR++.
  • On GLUE natural language tasks, FAVOR+ underperforms other linear attention methods (ELU, ReLU, FAVOR++, FAVOR#) by MM1–MM2 points on average, while FAVOR# recovers within MM3–MM4 points of the full softmax baseline.

A summary of variance and accuracy comparisons across methods:

Method Relative Variance (RV) Test Accuracy (UCI) GLUE Average
FAVOR+ MM5--MM6 Several % below FAVOR# 3–5 pts below other variants
SDERF MM71 Highest 0.2–0.5 pts from baseline

FAVOR# (using SDERF) achieves variance reductions by factors of MM8 and substantial performance gains relative to FAVOR+.

6. Limitations and Motivations for Generalized Random Features

Despite its unbiasedness and computational advantages, FAVOR+ is limited by excessive variance, a consequence of the lack of tunable parameters in MM9. This restricts its practical accuracy in both classical (kernel regression) and modern (Transformer) settings. The introduction of Generalized Exponential RFs (GERFs) and Dense-Exponential RFs (DERFs), which enable variance minimization by learning scalar or matrix parameters in the exponent, addresses the principal shortcomings of FAVOR+ and underlies the advances realized in FAVOR# (Likhosherstov et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positive Random Fourier Features (FAVOR+).